arxiv_ai 90% Match Research Paper AI researchers,Developers of AI safety systems,Mental health professionals,Platform safety teams 1 week ago

CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection

large-language-models › evaluation

📄 Abstract

Abstract: Detecting mental health crisis situations such as suicide ideation, rape, domestic violence, child abuse, and sexual harassment is a critical yet underexplored challenge for language models. When such situations arise during user--model interactions, models must reliably flag them, as failure to do so can have serious consequences. In this work, we introduce CRADLE BENCH, a benchmark for multi-faceted crisis detection. Unlike previous efforts that focus on a limited set of crisis types, our benchmark covers seven types defined in line with clinical standards and is the first to incorporate temporal labels. Our benchmark provides 600 clinician-annotated evaluation examples and 420 development examples, together with a training corpus of around 4K examples automatically labeled using a majority-vote ensemble of multiple language models, which significantly outperforms single-model annotation. We further fine-tune six crisis detection models on subsets defined by consensus and unanimous ensemble agreement, providing complementary models trained under different agreement criteria.

Authors (5)

Grace Byun

Rebecca Lipschutz

Sean T. Minton

Abigail Lott

Jinho D. Choi

Submitted

October 27, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

CRADLE BENCH is introduced as a comprehensive benchmark for multi-faceted mental health crisis and safety risk detection, covering seven clinically defined crisis types with temporal labels. It includes clinician-annotated evaluation data and a large, automatically labeled training corpus, significantly outperforming single-model annotation.

Business Value

Enables the development of safer AI systems that can reliably identify and respond to users in distress, crucial for platforms dealing with sensitive user interactions.

Paper Metadata

Innovation Type

Benchmark Dataset and Annotation Methodology

Deployment Feasibility

High. The benchmark facilitates the development and evaluation of practical crisis detection systems.

Limitations Addressed

Limited scope of previous efforts in crisis detection; lack of temporal labels; challenges in reliably flagging critical situations during user-model interactions.

Performance Gains

Majority-vote ensemble significantly outperforms single-model annotation for training data.

Technical Tags

Mental Health Crisis DetectionSafety Risk DetectionLanguage ModelsBenchmarkClinician-AnnotatedTemporal LabelsSuicide IdeationDomestic Violence

Research Topics

AI for Mental HealthNatural Language ProcessingBenchmark DevelopmentAI SafetyCrisis Intervention

Methods & Architectures

Benchmark creation (CRADLE BENCH)Clinician annotationTemporal labelingMajority-vote ensemble for training data labelingFine-tuning crisis detection models Language Models (various)

Applications & Tasks

Mental Healthcare AI Safety Customer Support Online Platforms Detection of mental health crisesIdentification of safety risks in textBias in crisis detection Detecting suicide ideation, rape, domestic violence, child abuse, sexual harassmentProviding reliable flagging of crisis situationsEvaluating LLMs on multi-faceted crisis detection

Datasets & Benchmarks

Datasets

CRADLE BENCH

Benchmarks

CRADLE BENCH (new benchmark)

AccuracyRecallPrecisionF1-score

Related Fields

PsychologyMental HealthNatural Language ProcessingAI EthicsComputer Science

Keywords

mental healthcrisis detectionsafety risklanguage modelsbenchmarkclinician annotationtemporal labelsAI safetyNLPsuicide ideation

Academic Context

#AI for Mental Health#Natural Language Processing#Benchmark Development#AI Safety#Crisis Intervention

Commercial Potential

Potential Products

AI-powered mental health support toolsContent moderation systems for sensitive topicsCrisis intervention platforms

Target Industries

HealthcareTechnologySocial MediaCustomer Service

Use Case Examples

Chatbots detecting signs of depression or suicidal intentOnline platforms flagging harmful content related to abuseAI assistants providing resources for mental health support

Competitive Edge

Provides a more comprehensive and clinically relevant benchmark than previous efforts, enabling more robust evaluation of LLMs for critical safety tasks.

Market Opportunity

Large and growing, as AI safety and mental health applications are increasingly important.

Revenue Models

Licensing of the benchmarkdevelopment of specialized crisis detection APIs.

Resource Requirements

Compute Needs

Moderate for fine-tuning models; high for training large ensemble models.

Data Requirements

Text data related to mental health crises, clinician annotations.

Deployment Constraints

Ethical considerations, potential for false positives/negatives, need for human oversight.

Scalability

The benchmark itself is scalable for evaluating various models. The deployment of detection systems depends on the underlying LLM infrastructure.

Regulatory Considerations

HIPAA (if dealing with protected health information)Ethical guidelines for AI in healthcare

Production Readiness

Maturity Level

Research

Time to Market

1-3 years for integration into safety-critical AI systems.

Patent Potential

Low, focused on dataset creation and evaluation methodology.

View Full Paper Back to Papers