Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The evaluation of intelligibility for TTS has reached a bottleneck, as
existing assessments heavily rely on word-by-word accuracy metrics such as WER,
which fail to capture the complexity of real-world speech or reflect human
comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice
Question Answering, a novel subjective approach evaluating the accuracy of key
information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour
news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal
that low WER does not necessarily guarantee high key-information accuracy,
exposing a gap between traditional metrics and practical intelligibility.
SP-MCQA shows that even state-of-the-art (SOTA) models still lack robust text
normalization and phonetic accuracy. This work underscores the urgent need for
high-level, more life-like evaluation criteria now that many systems already
excel at WER yet may fall short on real-world intelligibility.
Authors (4)
Hitomi Jin Ling Tee
Chaoren Wang
Zijie Zhang
Zhizheng Wu
Submitted
October 30, 2025
Key Contributions
This paper introduces SP-MCQA, a novel subjective evaluation approach for TTS intelligibility that measures the accuracy of key information in synthesized speech, moving beyond traditional WER. It releases SP-MCQA-Eval, an 8.76-hour benchmark dataset, and demonstrates that low WER does not guarantee high comprehension, highlighting a critical gap in current TTS evaluation metrics.
Business Value
Enables the development of more natural and understandable TTS systems, improving user experience for voice assistants, audio content, and accessibility tools.