arxiv_cl 98% Match Research Paper Speech AI researchers,Developers of voice assistants,Accessibility advocates,Clinicians 2 weeks ago

VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency

speech-audio › speech-recognition

📄 Abstract

Abstract: While Speech Large Language Models (Speech-LLMs) show strong performance in many applications, their robustness is critically under-tested, especially to speech disfluency. Existing evaluations often rely on idealized inputs, overlooking common disfluencies, particularly those associated with conditions like Parkinson's disease. This work investigates whether current Speech-LLMs can maintain performance when interacting with users who have speech impairments. To facilitate this inquiry, we introduce VocalBench-DF, a framework for the systematic evaluation of disfluency across a multi-dimensional taxonomy. Our evaluation of 22 mainstream Speech-LLMs reveals substantial performance degradation, indicating that their real-world readiness is limited. Further analysis identifies phoneme-level processing and long-context modeling as primary bottlenecks responsible for these failures. Strengthening recognition and reasoning capability from components and pipelines can substantially improve robustness. These findings highlight the urgent need for new methods to improve disfluency handling and build truly inclusive Speech-LLMs

Authors (6)

Hongcheng Liu

Yixuan Hou

Heyang Liu

Yuhao Wang

Yanfeng Wang

Yu Wang

Submitted

October 17, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper introduces VocalBench-DF, a benchmark for evaluating Speech LLM robustness to disfluency, particularly for users with speech impairments like Parkinson's disease. Evaluations of 22 mainstream Speech-LLMs reveal substantial performance degradation, identifying phoneme-level processing and long-context modeling as key bottlenecks, limiting their real-world readiness.

Business Value

Ensures that voice-enabled technologies are accessible and reliable for all users, including those with speech impairments, promoting inclusivity and expanding market reach.

Paper Metadata

Innovation Type

Benchmark and Evaluation

Deployment Feasibility

High, as it focuses on evaluation methodology.

Limitations Addressed

Critical under-testing of Speech-LLM robustness, especially to common disfluencies and speech impairments, leading to limited real-world readiness.

Performance Gains

Identifies substantial performance degradation in mainstream Speech-LLMs, highlighting areas for improvement rather than gains.

Technical Tags

Speech LLMsRobustness evaluationDisfluencySpeech impairmentsParkinson's diseaseVocalBench-DFPhoneme-level processingLong-context modelingSpeech recognitionSpeech understanding

Research Topics

Speech TechnologyAI RobustnessHuman-Computer InteractionSpeech ImpairmentsLarge Language Models

Methods & Architectures

Benchmark creation (VocalBench-DF)Systematic evaluationPerformance analysisBottleneck identification Speech Large Language Models (Speech-LLMs)

Applications & Tasks

Speech Recognition Voice Assistants Accessibility Healthcare Technology Under-testing robustness of Speech-LLMsPerformance degradation with speech disfluencyLack of evaluation for users with speech impairmentsIdentifying bottlenecks in Speech-LLMs Speech recognitionSpeech understandingRobustness evaluationInteraction with impaired speech

Datasets & Benchmarks

Benchmarks

VocalBench-DF

Performance degradationAccuracy metricsRobustness scores

Related Fields

Speech ProcessingNatural Language ProcessingHuman-Computer InteractionAssistive Technology

Keywords

Speech LLMRobustnessDisfluencySpeech ImpairmentVocalBench-DFBenchmarkEvaluationSpeech RecognitionParkinson's DiseasePhoneme ProcessingContext Modeling

Academic Context

#Speech Technology#AI Robustness#Human-Computer Interaction#Speech Impairments#Large Language Models

Commercial Potential

Potential Products

More robust voice assistantsAssistive communication devicesSpeech therapy tools

Target Industries

TechnologyHealthcareAccessibility

Use Case Examples

Improving voice control for individuals with Parkinson's diseaseMaking voice assistants more forgiving of natural speech patternsDeveloping accessible communication tools

Competitive Edge

Establishes a critical benchmark for evaluating the real-world robustness of Speech-LLMs, particularly for underserved user groups.

Market Opportunity

Growing market for voice-enabled devices and services,Increasing focus on accessibility

Resource Requirements

Compute Needs

Moderate, for running evaluations on Speech-LLMs.

Data Requirements

A diverse set of speech data, including disfluent speech and speech from individuals with impairments.

Scalability

The benchmark methodology is scalable to new Speech-LLMs and different types of disfluencies.

Regulatory Considerations

Data privacy for speech dataAccessibility standards

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers