arxiv_cl 92% Match Research Paper TTS Researchers,Speech Synthesis Engineers,HCI Researchers,Developers of voice-enabled applications 5 days ago

SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level

speech-audio › text-to-speech

📄 Abstract

Abstract: The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice Question Answering, a novel subjective approach evaluating the accuracy of key information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal that low WER does not necessarily guarantee high key-information accuracy, exposing a gap between traditional metrics and practical intelligibility. SP-MCQA shows that even state-of-the-art (SOTA) models still lack robust text normalization and phonetic accuracy. This work underscores the urgent need for high-level, more life-like evaluation criteria now that many systems already excel at WER yet may fall short on real-world intelligibility.

Authors (4)

Hitomi Jin Ling Tee

Chaoren Wang

Zijie Zhang

Zhizheng Wu

Submitted

October 30, 2025

arXiv Category

cs.SD

arXiv PDF

Key Contributions

This paper introduces SP-MCQA, a novel subjective evaluation approach for TTS intelligibility that measures the accuracy of key information in synthesized speech, moving beyond traditional WER. It releases SP-MCQA-Eval, an 8.76-hour benchmark dataset, and demonstrates that low WER does not guarantee high comprehension, highlighting a critical gap in current TTS evaluation metrics.

Business Value

Enables the development of more natural and understandable TTS systems, improving user experience for voice assistants, audio content, and accessibility tools.

Paper Metadata

Innovation Type

Evaluation Methodology / Dataset

Deployment Feasibility

The evaluation methodology is a resource for researchers and developers; deployment of improved TTS systems is ongoing.

Limitations Addressed

Bottleneck in TTS evaluation due to reliance on word-by-word accuracy metrics (WER),Failure of WER to capture real-world speech complexity or human comprehension needs,Gap between SOTA WER performance and practical intelligibility

Performance Gains

Demonstrates that even SOTA TTS models lack robust text normalization and phonetic accuracy, impacting key-information accuracy.,Highlights the limitations of WER as a sole metric for TTS quality.

Technical Tags

Text-to-Speech (TTS) evaluationIntelligibilitySpoken-Passage Multiple-Choice Question Answering (SP-MCQA)Word Error Rate (WER)Subjective evaluationKey information accuracyTTS intelligibilitySpeech synthesis

Research Topics

Speech SynthesisText-to-SpeechEvaluation MetricsHuman-Computer InteractionNatural Language Processing

Methods & Architectures

Spoken-Passage Multiple-Choice Question Answering (SP-MCQA)Subjective evaluationCreation of a news-style benchmark dataset (SP-MCQA-Eval)

Applications & Tasks

Accessibility Voice Assistants Audiobooks Human-Computer Interaction Evaluating TTS intelligibility beyond word accuracyMeasuring comprehension of synthesized speechIdentifying gaps in TTS quality Assessing the intelligibility of TTS systemsEvaluating the accuracy of key information conveyed by TTSComparing TTS models based on comprehension

Datasets & Benchmarks

Datasets

SP-MCQA-Eval

Benchmarks

SP-MCQA evaluation results showing gap between WER and key-information accuracy

Word Error Rate (WER)Key-information accuracy (via SP-MCQA)

Related Fields

Speech ProcessingNatural Language ProcessingHuman-Computer InteractionMachine LearningLinguistics

Keywords

TTStext-to-speechevaluationintelligibilitySP-MCQAWERsubjective evaluationspeech synthesiscomprehensionbenchmark

Academic Context

#Speech Synthesis#Text-to-Speech#Evaluation Metrics#Human-Computer Interaction#Natural Language Processing

Commercial Potential

Potential Products

More human-like TTS enginesImproved voice assistantsBetter accessibility tools for visually impaired users

Target Industries

TechnologyMediaAutomotiveCustomer ServiceHealthcare

Use Case Examples

Ensuring that a voice assistant clearly conveys critical information during a complex interaction.Producing audiobooks that are easy to understand and follow.

Competitive Edge

Introduces a new, more human-centric evaluation metric for TTS that addresses the shortcomings of traditional objective metrics like WER.

Market Opportunity

The market for voice-enabled technologies is rapidly growing.

Revenue Models

N/A (evaluation methodology)

Resource Requirements

Compute Needs

N/A (evaluation methodology)

Data Requirements

Requires the SP-MCQA-Eval dataset for evaluation.

Deployment Constraints

Subjective evaluations can be time-consuming and costly to conduct at scale.

Scalability

The methodology can be applied to evaluate any TTS system.

Production Readiness

Maturity Level

Research / Evaluation Methodology

Time to Market

Ongoing research and adoption

View Full Paper Back to Papers