📄 Abstract
Abstract: Everyday speech conveys far more than words, it reflects who we are, how we
feel, and the circumstances surrounding our interactions. Yet, most existing
speech datasets are acted, limited in scale, and fail to capture the expressive
richness of real-life communication. With the rise of large neural networks,
several large-scale speech corpora have emerged and been widely adopted across
various speech processing tasks. However, the field of voice conversion (VC)
still lacks large-scale, expressive, and real-life speech resources suitable
for modeling natural prosody and emotion. To fill this gap, we release
NaturalVoices (NV), the first large-scale spontaneous podcast dataset
specifically designed for emotion-aware voice conversion. It comprises 5,049
hours of spontaneous podcast recordings with automatic annotations for emotion
(categorical and attribute-based), speech quality, transcripts, speaker
identity, and sound events. The dataset captures expressive emotional variation
across thousands of speakers, diverse topics, and natural speaking styles. We
also provide an open-source pipeline with modular annotation tools and flexible
filtering, enabling researchers to construct customized subsets for a wide
range of VC tasks. Experiments demonstrate that NaturalVoices supports the
development of robust and generalizable VC models capable of producing natural,
expressive speech, while revealing limitations of current architectures when
applied to large-scale spontaneous data. These results suggest that
NaturalVoices is both a valuable resource and a challenging benchmark for
advancing the field of voice conversion. Dataset is available at:
https://huggingface.co/JHU-SmileLab
Authors (7)
Zongyang Du
Shreeram Suresh Chandra
Ismail Rasim Ulgen
Aurosweta Mahapatra
Ali N. Salman
Carlos Busso
+1 more
Submitted
October 31, 2025
Key Contributions
Introduces NaturalVoices (NV), the first large-scale spontaneous podcast dataset (5,049 hours) specifically designed for emotion-aware voice conversion. This dataset fills a critical gap by providing real-life expressive speech with rich annotations, enabling the development of more natural and emotionally nuanced synthetic voices.
Business Value
Enables the creation of more engaging and human-like synthetic voices for applications like virtual assistants, audiobooks, and personalized communication tools, improving user experience and accessibility.