Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Current benchmarks for AI clinician systems, often based on multiple-choice
exams or manual rubrics, fail to capture the depth, robustness, and safety
required for real-world clinical practice. To address this, we introduce the
GAPS framework, a multidimensional paradigm for evaluating \textbf{G}rounding
(cognitive depth), \textbf{A}dequacy (answer completeness),
\textbf{P}erturbation (robustness), and \textbf{S}afety. Critically, we
developed a fully automated, guideline-anchored pipeline to construct a
GAPS-aligned benchmark end-to-end, overcoming the scalability and subjectivity
limitations of prior work. Our pipeline assembles an evidence neighborhood,
creates dual graph and tree representations, and automatically generates
questions across G-levels. Rubrics are synthesized by a DeepResearch agent that
mimics GRADE-consistent, PICO-driven evidence review in a ReAct loop. Scoring
is performed by an ensemble of large language model (LLM) judges. Validation
confirmed our automated questions are high-quality and align with clinician
judgment. Evaluating state-of-the-art models on the benchmark revealed key
failure modes: performance degrades sharply with increased reasoning depth
(G-axis), models struggle with answer completeness (A-axis), and they are
highly vulnerable to adversarial perturbations (P-axis) as well as certain
safety issues (S-axis). This automated, clinically-grounded approach provides a
reproducible and scalable method for rigorously evaluating AI clinician systems
and guiding their development toward safer, more reliable clinical practice.
Authors (41)
Xiuyuan Chen
Tao Sun
Dexin Su
Ailing Yu
Junwei Liu
Zhe Chen
+35 more
Submitted
October 15, 2025
Key Contributions
Introduces the GAPS framework and an automated pipeline for evaluating AI clinicians, addressing limitations of manual benchmarks by focusing on Grounding, Adequacy, Perturbation, and Safety. The system uses LLM judges and mimics clinical evidence review processes.
Business Value
Enables rigorous, scalable, and objective evaluation of AI systems intended for clinical use, accelerating the safe and effective deployment of AI in healthcare.