arxiv_cl 90% Match Dataset Paper Speech Recognition Researchers,Speaker Diarization Researchers,Audio ML Engineers 1 week ago

LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization

speech-audio › speech-recognition

📄 Abstract

Abstract: We introduce LibriConvo, a simulated multi-speaker conversational dataset based on speaker-aware conversation simulation (SASC), designed to support training and evaluation of speaker diarization and automatic speech recognition (ASR) systems. Unlike prior resources that mostly rely on semantically disconnected utterances and implausible temporal gaps, LibriConvo ensures semantic coherence and realistic conversational timing. Our pipeline leverages CallHome with external VAD for reliable boundaries, applies compression to reduce unnaturally long silences, and organizes LibriTTS utterances by book to maintain contextual consistency. Acoustic realism is enhanced via a novel room impulse response selection procedure that ranks speaker-microphone configurations by spatial plausibility, balancing realism and diversity. The dataset comprises 240.1 hours across 1,496 dialogues with 830 unique speakers, split in a speaker-disjoint manner for robust evaluation. Baselines show that the sortformer model outperforms the pyannote pipeline in diarization, while a fine-tuned Fast Conformer-CTC XLarge with Serialized Output Training achieves 7.29\% WER for ASR, surpassing zero-shot Whisper-large-v3. LibriConvo provides a valuable resource for advancing multi-speaker speech processing research with realistic conversational dynamics and controlled experimental conditions.

Authors (2)

Máté Gedeon

Péter Mihajlik

Submitted

October 27, 2025

arXiv Category

eess.AS

arXiv PDF

Key Contributions

Introduces LibriConvo, a simulated multi-speaker conversational dataset designed for training and evaluating speaker diarization and ASR systems. It ensures semantic coherence and realistic conversational timing by leveraging existing corpora and employing novel techniques for acoustic realism, such as RIR selection.

Business Value

Provides a valuable resource for developing more robust and accurate speech technologies, enabling better voice assistants, transcription services, and call center analytics.

Paper Metadata

Innovation Type

Dataset Creation

Deployment Feasibility

High, as it's a dataset that enables better model development.

Limitations Addressed

Prior resources often lacked semantic coherence, had implausible temporal gaps, and limited acoustic realism for multi-speaker conversations.

Performance Gains

Baselines show improved performance on ASR and diarization tasks due to the dataset's realism and coherence.

Technical Tags

Conversational DatasetSpeaker DiarizationAutomatic Speech RecognitionASRMulti-speakerSemantic CoherenceAcoustic RealismRoom Impulse ResponseSpeaker DisjointDataset Simulation

Research Topics

Speech ProcessingDataset CreationSpeaker DiarizationAutomatic Speech RecognitionAudio Synthesis

Methods & Architectures

Speaker-aware Conversation Simulation (SASC)VAD (Voice Activity Detection)Room Impulse Response (RIR) selectionDataset Augmentation (compression)

Applications & Tasks

Speech Technology Human-Computer Interaction Audio Analysis Lack of realistic conversational datasetsSemantic incoherence in simulated conversationsUnrealistic temporal gapsImproving ASR and Diarization performance Training and evaluating speaker diarization systemsTraining and evaluating ASR systemsSimulating realistic multi-speaker conversations

Datasets & Benchmarks

Datasets

LibriConvo, CallHome, LibriTTS

Benchmarks

Baselines show improvements (specific metrics not detailed in abstract).

ASR performanceDiarization performanceSemantic CoherenceAcoustic Realism

Related Fields

Natural Language ProcessingAudio Signal ProcessingMachine Learning

Keywords

Conversational DatasetSpeaker DiarizationAutomatic Speech RecognitionASRMulti-speakerSpeech SimulationDatasetLibriConvoAcoustic RealismSemantic CoherenceCallHomeLibriTTSAudio

Academic Context

#Speech Processing#Dataset Creation#Speaker Diarization#Automatic Speech Recognition#Audio Synthesis

Commercial Potential

Potential Products

More accurate ASR systemsImproved speaker diarization toolsRealistic conversational AI agents

Target Industries

TechnologyTelecommunicationsMediaCustomer Service

Use Case Examples

Developing better meeting transcription softwareEnhancing voice assistants for multi-user environmentsAnalyzing call center conversations

Competitive Edge

Offers a more realistic and semantically coherent alternative to existing simulated conversational datasets for ASR and diarization.

Market Opportunity

Large market for speech technology and audio processing tools.

Revenue Models

Dataset availability can drive adoption of related research and tools.

Resource Requirements

Compute Needs

Moderate for dataset generation, high for training models on it.

Data Requirements

Requires access to CallHome and LibriTTS corpora, VAD tools.

Deployment Constraints

Dataset size and computational resources for training.

Scalability

Dataset size is substantial (240.1 hours), enabling training of scalable models.

Production Readiness

Maturity Level

Dataset

Time to Market

1-2 years for significant impact on ASR/diarization model development.

Licensing

Likely CC-BY or similar, based on source data.

View Full Paper Back to Papers