arxiv_cl 95% Match Research Paper AI Safety Researchers,Healthcare AI Developers,Machine Learning Engineers,Clinicians,AI Ethicists 1 week ago

Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?

ai-safety › interpretability

📄 Abstract

Abstract: Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.

Authors (4)

Teague McMillan

Gabriele Dominici

Martin Gjoreski

Marc Langheinrich

Submitted

October 28, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper investigates factors influencing explanation faithfulness in LLMs, particularly for healthcare applications. It finds that few-shot example quality/quantity and prompting design significantly impact faithfulness, while instruction tuning improves it on medical tasks, offering practical insights for deployment.

Business Value

Increases trust and safety in AI-driven decision support systems, especially in healthcare, by ensuring explanations are reliable and clinicians can understand the basis of AI recommendations.

Paper Metadata

Innovation Type

Empirical Analysis and Best Practices

Deployment Feasibility

High, focuses on controllable factors like prompting and few-shot examples during inference and training.

Limitations Addressed

LLMs often produce explanations that do not faithfully reflect their predictions, which is critical in high-stakes domains like healthcare.

Performance Gains

Instruction tuning improves measured faithfulness on MedQA.

Technical Tags

FaithfulnessLLM ExplanationsHealthcare AIFew-shot learningPrompting strategiesInstruction TuningModel InterpretabilityBias Detection

Research Topics

AI InterpretabilityLLM ExplanationsAI SafetyHealthcare AIModel Robustness

Methods & Architectures

Few-shot example manipulation (quantity and quality)Prompting strategy variationInstruction tuning analysisEvaluation on BBQ and MedQA datasets GPT-4.1-miniLLaMA 70BLLaMA 8B

Applications & Tasks

Healthcare AI Safety Explainable AI (XAI) Decision Support Systems Ensuring LLM explanations faithfully reflect predictionsMitigating risks of unfaithful explanations in healthcareImproving LLM trustworthiness Evaluating explanation faithfulnessImproving faithfulness through inference and training choices

Datasets & Benchmarks

Datasets

BBQ, MedQA

Faithfulness

Related Fields

Artificial IntelligenceMachine LearningNatural Language ProcessingMedical InformaticsAI Ethics

Keywords

faithfulnessLLM explanationsinterpretabilityAI safetyhealthcare AIfew-shot learningpromptinginstruction tuningtrustworthy AIdecision supportbias

Academic Context

#AI Interpretability#LLM Explanations#AI Safety#Healthcare AI#Model Robustness

Commercial Potential

Potential Products

Tools for generating faithful LLM explanationsAI decision support systems with explainability featuresPlatforms for evaluating LLM trustworthiness

Target Industries

HealthcareTechnologyFinanceLegal

Use Case Examples

AI diagnostic tools that explain their reasoning to doctorsFinancial advisory systems that justify their recommendationsEnsuring LLM-generated summaries accurately reflect source material

Competitive Edge

Provides actionable insights into improving LLM explanation faithfulness, a critical aspect of AI safety and trustworthiness, particularly for high-stakes applications.

Market Opportunity

Massive market for trustworthy AI, especially in healthcare.

Resource Requirements

Compute Needs

Moderate, for running inference and potentially fine-tuning LLMs.

Data Requirements

Access to BBQ and MedQA datasets.

Deployment Constraints

Requires careful prompt engineering and potentially fine-tuning for specific applications.

Scalability

Findings are applicable across different LLMs and deployment scenarios.

Regulatory Considerations

Highly relevant for healthcare AI regulations (e.g.FDAHIPAA) and AI explainability requirements.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for integrating findings into production systems.

View Full Paper Back to Papers