Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper AI Researchers,Medical AI Developers,Healthcare Professionals,Regulators in Healthcare 1 day ago

Medical Hallucinations in Foundation Models and Their Impact on Healthcare

large-language-models › alignment
📄 Abstract

Abstract: Hallucinations in foundation models arise from autoregressive training objectives that prioritize token-likelihood optimization over epistemic accuracy, fostering overconfidence and poorly calibrated uncertainty. We define medical hallucination as any model-generated output that is factually incorrect, logically inconsistent, or unsupported by authoritative clinical evidence in ways that could alter clinical decisions. We evaluated 11 foundation models (7 general-purpose, 4 medical-specialized) across seven medical hallucination tasks spanning medical reasoning and biomedical information retrieval. General-purpose models achieved significantly higher proportions of hallucination-free responses than medical-specialized models (median: 76.6% vs 51.3%, difference = 25.2%, 95% CI: 18.7-31.3%, Mann-Whitney U = 27.0, p = 0.012, rank-biserial r = -0.64). Top-performing models such as Gemini-2.5 Pro exceeded 97% accuracy when augmented with chain-of-thought prompting (base: 87.6%), while medical-specialized models like MedGemma ranged from 28.6-61.9% despite explicit training on medical corpora. Chain-of-thought reasoning significantly reduced hallucinations in 86.4% of tested comparisons after FDR correction (q < 0.05), demonstrating that explicit reasoning traces enable self-verification and error detection. Physician audits confirmed that 64-72% of residual hallucinations stemmed from causal or temporal reasoning failures rather than knowledge gaps. A global survey of clinicians (n = 70) validated real-world impact: 91.8% had encountered medical hallucinations, and 84.7% considered them capable of causing patient harm. The underperformance of medical-specialized models despite domain training indicates that safety emerges from sophisticated reasoning capabilities and broad knowledge integration developed during large-scale pre-training, not from narrow optimization.
Authors (27)
Yubin Kim
Hyewon Jeong
Shan Chen
Shuyue Stella Li
Chanwoo Park
Mingyu Lu
+21 more
Submitted
February 26, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

Defines and evaluates medical hallucinations in foundation models, finding general-purpose models outperform medical-specialized ones. It attributes hallucinations to autoregressive objectives prioritizing token-likelihood over epistemic accuracy. Augmenting Gemini-2.5 Pro with CoT significantly improved its accuracy in hallucination-free responses.

Business Value

Crucial for ensuring patient safety and trust in AI-driven healthcare applications by identifying and mitigating risks associated with inaccurate medical information generated by AI.