Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper AI researchers in healthcare,Medical professionals,Regulators,Developers of clinical AI systems 3 weeks ago

GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

large-language-models › evaluation
📄 Abstract

Abstract: Current benchmarks for AI clinician systems, often based on multiple-choice exams or manual rubrics, fail to capture the depth, robustness, and safety required for real-world clinical practice. To address this, we introduce the GAPS framework, a multidimensional paradigm for evaluating \textbf{G}rounding (cognitive depth), \textbf{A}dequacy (answer completeness), \textbf{P}erturbation (robustness), and \textbf{S}afety. Critically, we developed a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end, overcoming the scalability and subjectivity limitations of prior work. Our pipeline assembles an evidence neighborhood, creates dual graph and tree representations, and automatically generates questions across G-levels. Rubrics are synthesized by a DeepResearch agent that mimics GRADE-consistent, PICO-driven evidence review in a ReAct loop. Scoring is performed by an ensemble of large language model (LLM) judges. Validation confirmed our automated questions are high-quality and align with clinician judgment. Evaluating state-of-the-art models on the benchmark revealed key failure modes: performance degrades sharply with increased reasoning depth (G-axis), models struggle with answer completeness (A-axis), and they are highly vulnerable to adversarial perturbations (P-axis) as well as certain safety issues (S-axis). This automated, clinically-grounded approach provides a reproducible and scalable method for rigorously evaluating AI clinician systems and guiding their development toward safer, more reliable clinical practice.
Authors (41)
Xiuyuan Chen
Tao Sun
Dexin Su
Ailing Yu
Junwei Liu
Zhe Chen
+35 more
Submitted
October 15, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

Introduces the GAPS framework and an automated pipeline for evaluating AI clinicians, addressing limitations of manual benchmarks by focusing on Grounding, Adequacy, Perturbation, and Safety. The system uses LLM judges and mimics clinical evidence review processes.

Business Value

Enables rigorous, scalable, and objective evaluation of AI systems intended for clinical use, accelerating the safe and effective deployment of AI in healthcare.