Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: While Speech Large Language Models (Speech-LLMs) show strong performance in
many applications, their robustness is critically under-tested, especially to
speech disfluency. Existing evaluations often rely on idealized inputs,
overlooking common disfluencies, particularly those associated with conditions
like Parkinson's disease. This work investigates whether current Speech-LLMs
can maintain performance when interacting with users who have speech
impairments. To facilitate this inquiry, we introduce VocalBench-DF, a
framework for the systematic evaluation of disfluency across a
multi-dimensional taxonomy. Our evaluation of 22 mainstream Speech-LLMs reveals
substantial performance degradation, indicating that their real-world readiness
is limited. Further analysis identifies phoneme-level processing and
long-context modeling as primary bottlenecks responsible for these failures.
Strengthening recognition and reasoning capability from components and
pipelines can substantially improve robustness. These findings highlight the
urgent need for new methods to improve disfluency handling and build truly
inclusive Speech-LLMs
Authors (6)
Hongcheng Liu
Yixuan Hou
Heyang Liu
Yuhao Wang
Yanfeng Wang
Yu Wang
Submitted
October 17, 2025
Key Contributions
This paper introduces VocalBench-DF, a benchmark for evaluating Speech LLM robustness to disfluency, particularly for users with speech impairments like Parkinson's disease. Evaluations of 22 mainstream Speech-LLMs reveal substantial performance degradation, identifying phoneme-level processing and long-context modeling as key bottlenecks, limiting their real-world readiness.
Business Value
Ensures that voice-enabled technologies are accessible and reliable for all users, including those with speech impairments, promoting inclusivity and expanding market reach.