arxiv_cl 95% Match Research Paper AI researchers in healthcare,Medical professionals,Regulators,Developers of clinical AI systems 3 weeks ago

GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

large-language-models › evaluation

📄 Abstract

Abstract: Current benchmarks for AI clinician systems, often based on multiple-choice exams or manual rubrics, fail to capture the depth, robustness, and safety required for real-world clinical practice. To address this, we introduce the GAPS framework, a multidimensional paradigm for evaluating \textbf{G}rounding (cognitive depth), \textbf{A}dequacy (answer completeness), \textbf{P}erturbation (robustness), and \textbf{S}afety. Critically, we developed a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end, overcoming the scalability and subjectivity limitations of prior work. Our pipeline assembles an evidence neighborhood, creates dual graph and tree representations, and automatically generates questions across G-levels. Rubrics are synthesized by a DeepResearch agent that mimics GRADE-consistent, PICO-driven evidence review in a ReAct loop. Scoring is performed by an ensemble of large language model (LLM) judges. Validation confirmed our automated questions are high-quality and align with clinician judgment. Evaluating state-of-the-art models on the benchmark revealed key failure modes: performance degrades sharply with increased reasoning depth (G-axis), models struggle with answer completeness (A-axis), and they are highly vulnerable to adversarial perturbations (P-axis) as well as certain safety issues (S-axis). This automated, clinically-grounded approach provides a reproducible and scalable method for rigorously evaluating AI clinician systems and guiding their development toward safer, more reliable clinical practice.

Authors (41)

Xiuyuan Chen

Tao Sun

Dexin Su

Ailing Yu

Junwei Liu

Zhe Chen

+35 more

Submitted

October 15, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduces the GAPS framework and an automated pipeline for evaluating AI clinicians, addressing limitations of manual benchmarks by focusing on Grounding, Adequacy, Perturbation, and Safety. The system uses LLM judges and mimics clinical evidence review processes.

Business Value

Enables rigorous, scalable, and objective evaluation of AI systems intended for clinical use, accelerating the safe and effective deployment of AI in healthcare.

Paper Metadata

Innovation Type

New Framework/Methodology

Deployment Feasibility

High for the evaluation framework itself; moderate for building AI clinicians that pass this benchmark.

Limitations Addressed

Lack of depth, robustness, and safety in current AI clinician benchmarks,Scalability and subjectivity of manual evaluation rubrics,Failure to capture real-world clinical practice requirements

Technical Tags

AI clinician evaluationclinical reasoningbenchmark automationguideline anchoringGAPS frameworkevidence synthesisLLM judgesReAct loopdomain-specific evaluation

Research Topics

Medical AIAI SafetyEvaluation MethodologiesClinical Decision SupportAutomated Assessment

Methods & Architectures

automated pipelineguideline anchoringevidence neighborhood assemblygraph and tree representationsReAct loop for rubric synthesisLLM judges for scoringensemble scoring Large Language Models (LLMs)DeepResearch agent

Applications & Tasks

Healthcare AI Clinical Decision Support Systems Medical Education Inadequacy of current AI clinician benchmarksScalability and subjectivity in evaluationEnsuring safety and robustness in clinical AI Evaluating AI cliniciansAssessing clinical reasoningAutomated medical assessment

Datasets & Benchmarks

Benchmarks

GAPS-aligned benchmark

G (Grounding)A (Adequacy)P (Perturbation)S (Safety)

Related Fields

Medical InformaticsArtificial Intelligence in MedicineMachine Learning EvaluationClinical Guidelines

Keywords

AI clinicianhealthcare AIevaluationbenchmarkGAPSclinical reasoningsafetyrobustnessautomated assessmentLLM

Academic Context

#Medical AI#AI Safety#Evaluation Methodologies#Clinical Decision Support#Automated Assessment

Technology Stack

Frameworks & Libraries

ReAct

Commercial Potential

Potential Products

AI-powered diagnostic toolsClinical decision support systemsMedical training simulators

Target Industries

HealthcareBiotechnologyMedical Technology

Use Case Examples

Evaluating an AI system designed to assist doctors in diagnosing rare diseasesTesting an AI chatbot for patient triage and preliminary assessmentAssessing AI's ability to provide evidence-based treatment recommendations

Competitive Edge

Provides a more comprehensive and automated evaluation standard for AI clinicians compared to existing methods, focusing on critical aspects like safety and grounding.

Market Opportunity

Massive and rapidly growing market for AI in healthcare.

Revenue Models

Licensing of the GAPS evaluation frameworkconsulting services for AI clinical development.

Resource Requirements

Compute Needs

High (for automated pipeline, LLM judges, and benchmark generation)

Data Requirements

Clinical guidelines, medical literature, evidence neighborhoods

Deployment Constraints

The automated pipeline relies on access to and accurate interpretation of clinical guidelines and literature.

Scalability

The automated nature of the pipeline makes it highly scalable for generating large, diverse benchmarks.

Regulatory Considerations

Highdirectly relevant to FDA/EMA approval processes for medical AI.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years (for AI clinicians to meet this standard)

Patent Potential

Moderate (for the automated pipeline and framework)

View Full Paper Back to Papers