Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 90% Match Dataset Paper Speech Recognition Researchers,Speaker Diarization Researchers,Audio ML Engineers 1 week ago

LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization

speech-audio β€Ί speech-recognition
πŸ“„ Abstract

Abstract: We introduce LibriConvo, a simulated multi-speaker conversational dataset based on speaker-aware conversation simulation (SASC), designed to support training and evaluation of speaker diarization and automatic speech recognition (ASR) systems. Unlike prior resources that mostly rely on semantically disconnected utterances and implausible temporal gaps, LibriConvo ensures semantic coherence and realistic conversational timing. Our pipeline leverages CallHome with external VAD for reliable boundaries, applies compression to reduce unnaturally long silences, and organizes LibriTTS utterances by book to maintain contextual consistency. Acoustic realism is enhanced via a novel room impulse response selection procedure that ranks speaker-microphone configurations by spatial plausibility, balancing realism and diversity. The dataset comprises 240.1 hours across 1,496 dialogues with 830 unique speakers, split in a speaker-disjoint manner for robust evaluation. Baselines show that the sortformer model outperforms the pyannote pipeline in diarization, while a fine-tuned Fast Conformer-CTC XLarge with Serialized Output Training achieves 7.29\% WER for ASR, surpassing zero-shot Whisper-large-v3. LibriConvo provides a valuable resource for advancing multi-speaker speech processing research with realistic conversational dynamics and controlled experimental conditions.
Authors (2)
MΓ‘tΓ© Gedeon
PΓ©ter Mihajlik
Submitted
October 27, 2025
arXiv Category
eess.AS
arXiv PDF

Key Contributions

Introduces LibriConvo, a simulated multi-speaker conversational dataset designed for training and evaluating speaker diarization and ASR systems. It ensures semantic coherence and realistic conversational timing by leveraging existing corpora and employing novel techniques for acoustic realism, such as RIR selection.

Business Value

Provides a valuable resource for developing more robust and accurate speech technologies, enabling better voice assistants, transcription services, and call center analytics.