Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 90% Match Research Paper Speech researchers,Machine learning engineers,Biometrics developers,AI researchers 2 weeks ago

DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model

speech-audio › speech-recognition
📄 Abstract

Abstract: Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce DELULU, a speaker-aware self-supervised foundational model that addresses this limitation by integrating external supervision into the pseudo-label generation process. DELULU leverages frame-level embeddings from ReDimNet, a state-of-the-art speaker verification model, to guide the k-means clustering step during pre-training, introducing a strong speaker-discriminative inductive bias that aligns representation learning with speaker identity. The model is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization. DELULU significantly outperforms prior self-supervised learning (SSL) models across a range of speaker-centric tasks, achieving up to 62% relative improvement in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks such as gender, age, accent, and speaker counting. Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.
Authors (3)
Massa Baali
Rita Singh
Bhiksha Raj
Submitted
October 20, 2025
arXiv Category
cs.SD
arXiv PDF

Key Contributions

DELULU is a novel speaker-aware self-supervised speech foundational model that significantly improves speaker discriminative features. It achieves this by integrating external speaker embeddings (from ReDimNet) into the pre-training clustering process, introducing a strong speaker-identity inductive bias. This dual objective approach (masked prediction and denoising) leads to state-of-the-art performance on speaker-centric tasks.

Business Value

Enables more accurate and robust speaker identification and verification systems, crucial for security applications, personalized user experiences, and efficient call center operations. It can also improve the performance of downstream speech tasks that benefit from speaker information.