Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 93% Match Research Paper Speech Technology Researchers,Machine Learning Engineers,AI Ethicists,Linguists,Developers of Voice Assistants 1 week ago

Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models

speech-audio โ€บ audio-generation
๐Ÿ“„ Abstract

Abstract: Recent advances in speech foundation models (SFMs) have enabled the direct processing of spoken language from raw audio, bypassing intermediate textual representations. This capability allows SFMs to be exposed to, and potentially respond to, rich paralinguistic variations embedded in the input speech signal. One under-explored dimension of paralinguistic variation is voice quality, encompassing phonation types such as creaky and breathy voice. These phonation types are known to influence how listeners infer affective state, stance and social meaning in speech. Existing benchmarks for speech understanding largely rely on multiple-choice question answering (MCQA) formats, which are prone to failure and therefore unreliable in capturing the nuanced ways paralinguistic features influence model behaviour. In this paper, we probe SFMs through open-ended generation tasks and speech emotion recognition, evaluating whether model behaviours are consistent across different phonation inputs. We introduce a new parallel dataset featuring synthesized modifications to voice quality, designed to evaluate SFM responses to creaky and breathy voice. Our work provides the first examination of SFM sensitivity to these particular non-lexical aspects of speech perception.
Authors (4)
Harm Lameris
Shree Harsha Bokkahalli Satish
Joakim Gustafson
ร‰va Szรฉkely
Submitted
October 29, 2025
arXiv Category
eess.AS
arXiv PDF

Key Contributions

This paper proposes voice quality variation (phonation types like creaky/breathy voice) as a crucial evaluation dimension for Speech Foundation Models (SFMs). It argues that existing benchmarks are insufficient and probes SFMs using open-ended generation and emotion recognition tasks to assess their behavior across different phonation types.

Business Value

Leads to the development of more robust and human-like speech technologies that can better understand and respond to the full spectrum of human vocal expression, improving user experience in voice assistants and other applications.