arxiv_ml 90% Match Dataset Paper Speech researchers,AI developers,Voice actors,HCI researchers 1 day ago

NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion

speech-audio › text-to-speech

📄 Abstract

Abstract: Everyday speech conveys far more than words, it reflects who we are, how we feel, and the circumstances surrounding our interactions. Yet, most existing speech datasets are acted, limited in scale, and fail to capture the expressive richness of real-life communication. With the rise of large neural networks, several large-scale speech corpora have emerged and been widely adopted across various speech processing tasks. However, the field of voice conversion (VC) still lacks large-scale, expressive, and real-life speech resources suitable for modeling natural prosody and emotion. To fill this gap, we release NaturalVoices (NV), the first large-scale spontaneous podcast dataset specifically designed for emotion-aware voice conversion. It comprises 5,049 hours of spontaneous podcast recordings with automatic annotations for emotion (categorical and attribute-based), speech quality, transcripts, speaker identity, and sound events. The dataset captures expressive emotional variation across thousands of speakers, diverse topics, and natural speaking styles. We also provide an open-source pipeline with modular annotation tools and flexible filtering, enabling researchers to construct customized subsets for a wide range of VC tasks. Experiments demonstrate that NaturalVoices supports the development of robust and generalizable VC models capable of producing natural, expressive speech, while revealing limitations of current architectures when applied to large-scale spontaneous data. These results suggest that NaturalVoices is both a valuable resource and a challenging benchmark for advancing the field of voice conversion. Dataset is available at: https://huggingface.co/JHU-SmileLab

Authors (7)

Zongyang Du

Shreeram Suresh Chandra

Ismail Rasim Ulgen

Aurosweta Mahapatra

Ali N. Salman

Carlos Busso

+1 more

Submitted

October 31, 2025

arXiv Category

eess.AS

arXiv PDF

Key Contributions

Introduces NaturalVoices (NV), the first large-scale spontaneous podcast dataset (5,049 hours) specifically designed for emotion-aware voice conversion. This dataset fills a critical gap by providing real-life expressive speech with rich annotations, enabling the development of more natural and emotionally nuanced synthetic voices.

Business Value

Enables the creation of more engaging and human-like synthetic voices for applications like virtual assistants, audiobooks, and personalized communication tools, improving user experience and accessibility.

Paper Metadata

Innovation Type

Dataset Creation

Deployment Feasibility

High, as the dataset facilitates research and development of new voice conversion models.

Limitations Addressed

Existing speech datasets are often acted, limited in scale, and lack expressive richness.,Lack of resources for modeling natural prosody and emotion in voice conversion.

Technical Tags

voice conversionspontaneous speechemotional speechpodcast datasetlarge-scale datasetprosody modelingemotion recognitionspeech processingnaturalnessexpressiveness

Research Topics

Speech ProcessingNatural Language ProcessingArtificial IntelligenceHuman-Computer InteractionAffective Computing

Applications & Tasks

Speech Synthesis Voice Acting Virtual Assistants Gaming Accessibility Lack of large-scale, expressive, real-life speech datasetsModeling natural prosody and emotion in voice conversionCreating more natural and emotionally resonant synthetic voices Voice conversionSpeech synthesisEmotion-aware speech generation

Datasets & Benchmarks

Datasets

NaturalVoices (NV)

speech qualityemotion annotationtranscriptsspeaker identitysound events

Related Fields

Speech TechnologyNatural Language ProcessingMachine LearningAffective ComputingComputational Linguistics

Keywords

voice conversionspontaneous speechemotional speechpodcast datasetlarge-scale datasetprosodyemotionspeech synthesisnaturalnessexpressivenessspeech processingaudio dataset

Academic Context

#Speech Processing#Natural Language Processing#Artificial Intelligence#Human-Computer Interaction#Affective Computing

Commercial Potential

Potential Products

Advanced voice cloning softwareEmotionally expressive virtual assistantsPersonalized audio content generation tools

Target Industries

Media and EntertainmentTechnologyGamingCustomer ServiceAccessibility

Use Case Examples

Creating synthetic voices that convey genuine emotion for characters in games or animations.Developing virtual assistants that can respond with appropriate emotional tone.Generating personalized audio content for users with specific emotional needs.

Competitive Edge

Provides a unique and valuable resource that enables advancements beyond current capabilities limited by acted or less expressive speech datasets.

Market Opportunity

Growing market for AI-powered speech synthesis and voice technologies.

Revenue Models

Access to the dataset for research and commercial use (potentially tiered).

Resource Requirements

Compute Needs

Moderate for dataset processing and annotation, high for training advanced voice conversion models.

Data Requirements

Requires large amounts of diverse, spontaneous, and emotionally expressive speech data.

Deployment Constraints

The quality of voice conversion depends heavily on the underlying models trained on this dataset.

Scalability

The dataset itself is large-scale, enabling the training of scalable voice conversion models.

Production Readiness

Maturity Level

Dataset Release

Time to Market

1-3 years for significant impact on voice conversion technologies.

Patent Potential

Low, primarily related to the dataset itself.

View Full Paper Back to Papers