arxiv_cl 95% Match Research Paper AI Researchers,Medical AI Developers,Healthcare Professionals,Regulators in Healthcare 1 day ago

Medical Hallucinations in Foundation Models and Their Impact on Healthcare

large-language-models › alignment

📄 Abstract

Abstract: Hallucinations in foundation models arise from autoregressive training objectives that prioritize token-likelihood optimization over epistemic accuracy, fostering overconfidence and poorly calibrated uncertainty. We define medical hallucination as any model-generated output that is factually incorrect, logically inconsistent, or unsupported by authoritative clinical evidence in ways that could alter clinical decisions. We evaluated 11 foundation models (7 general-purpose, 4 medical-specialized) across seven medical hallucination tasks spanning medical reasoning and biomedical information retrieval. General-purpose models achieved significantly higher proportions of hallucination-free responses than medical-specialized models (median: 76.6% vs 51.3%, difference = 25.2%, 95% CI: 18.7-31.3%, Mann-Whitney U = 27.0, p = 0.012, rank-biserial r = -0.64). Top-performing models such as Gemini-2.5 Pro exceeded 97% accuracy when augmented with chain-of-thought prompting (base: 87.6%), while medical-specialized models like MedGemma ranged from 28.6-61.9% despite explicit training on medical corpora. Chain-of-thought reasoning significantly reduced hallucinations in 86.4% of tested comparisons after FDR correction (q < 0.05), demonstrating that explicit reasoning traces enable self-verification and error detection. Physician audits confirmed that 64-72% of residual hallucinations stemmed from causal or temporal reasoning failures rather than knowledge gaps. A global survey of clinicians (n = 70) validated real-world impact: 91.8% had encountered medical hallucinations, and 84.7% considered them capable of causing patient harm. The underperformance of medical-specialized models despite domain training indicates that safety emerges from sophisticated reasoning capabilities and broad knowledge integration developed during large-scale pre-training, not from narrow optimization.

Authors (27)

Yubin Kim

Hyewon Jeong

Shan Chen

Shuyue Stella Li

Chanwoo Park

Mingyu Lu

+21 more

Submitted

February 26, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Defines and evaluates medical hallucinations in foundation models, finding general-purpose models outperform medical-specialized ones. It attributes hallucinations to autoregressive objectives prioritizing token-likelihood over epistemic accuracy. Augmenting Gemini-2.5 Pro with CoT significantly improved its accuracy in hallucination-free responses.

Business Value

Crucial for ensuring patient safety and trust in AI-driven healthcare applications by identifying and mitigating risks associated with inaccurate medical information generated by AI.

Paper Metadata

Innovation Type

Empirical Finding / Analysis

Deployment Feasibility

High, as it's an evaluation and analysis study.

Limitations Addressed

Prevalence of medical hallucinations in foundation models, particularly medical-specialized ones, and their potential negative impact on clinical decisions.

Performance Gains

Gemini-2.5 Pro achieved >97% accuracy with CoT (vs. 87.6% base). General-purpose models had higher hallucination-free rates (median 76.6%) than medical-specialized (median 51.3%).

Technical Tags

medical hallucinationsfoundation modelshealthcareepistemic accuracyuncertainty calibrationautoregressive trainingclinical decisionsmedical reasoningbiomedical information retrievalchain-of-thought prompting

Research Topics

AI SafetyLLM HallucinationsMedical AIFoundation ModelsReliability in Healthcare

Methods & Architectures

Evaluation on Medical Hallucination TasksComparative Analysis (General vs. Medical LLMs)Chain-of-Thought Augmentation Foundation ModelsGeneral-purpose LLMsMedical-specialized LLMs

Applications & Tasks

Healthcare Medical AI Biomedical Research Foundation Models Reducing Medical HallucinationsImproving Factual Accuracy in Medical AIAssessing Reliability of Foundation Models in Healthcare Medical ReasoningBiomedical Information RetrievalClinical Decision Support

Related Fields

Artificial IntelligenceMachine LearningHealthcare InformaticsMedical EthicsNatural Language Processing

Keywords

medical hallucinationsfoundation modelsLLMhealthcareAI safetyaccuracyreliabilityreasoninginformation retrievalchain-of-thoughtclinical decisionsbiomedicalartificial intelligencemachine learningevaluation

Academic Context

#AI Safety#LLM Hallucinations#Medical AI#Foundation Models#Reliability in Healthcare

Commercial Potential

Potential Products

AI systems with reduced medical hallucination ratesTools for verifying medical information generated by AI

Target Industries

HealthcareBiotechnologyPharmaceuticalsMedical Technology

Use Case Examples

Developing safer AI diagnostic tools.Creating reliable AI assistants for medical professionals.Ensuring accuracy in AI-generated medical literature summaries.

Competitive Edge

Highlights a critical flaw in current medical AI models and suggests general-purpose models might be more reliable in some aspects, providing a new direction for medical AI development.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers