Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Self-supervised speech models have achieved remarkable success on
content-driven tasks, yet they remain limited in capturing
speaker-discriminative features critical for verification, diarization, and
profiling applications. We introduce DELULU, a speaker-aware self-supervised
foundational model that addresses this limitation by integrating external
supervision into the pseudo-label generation process. DELULU leverages
frame-level embeddings from ReDimNet, a state-of-the-art speaker verification
model, to guide the k-means clustering step during pre-training, introducing a
strong speaker-discriminative inductive bias that aligns representation
learning with speaker identity. The model is trained using a dual objective
that combines masked prediction and denoising, further enhancing robustness and
generalization. DELULU significantly outperforms prior self-supervised learning
(SSL) models across a range of speaker-centric tasks, achieving up to 62%
relative improvement in equal error rate (EER) for speaker verification and
consistent gains on zero-shot profiling tasks such as gender, age, accent, and
speaker counting. Our findings demonstrate that DELULU is a strong universal
encoder for speaker-aware speech processing, enabling superior performance even
without task-specific fine-tuning.
Authors (3)
Massa Baali
Rita Singh
Bhiksha Raj
Submitted
October 20, 2025
Key Contributions
DELULU is a novel speaker-aware self-supervised speech foundational model that significantly improves speaker discriminative features. It achieves this by integrating external speaker embeddings (from ReDimNet) into the pre-training clustering process, introducing a strong speaker-identity inductive bias. This dual objective approach (masked prediction and denoising) leads to state-of-the-art performance on speaker-centric tasks.
Business Value
Enables more accurate and robust speaker identification and verification systems, crucial for security applications, personalized user experiences, and efficient call center operations. It can also improve the performance of downstream speech tasks that benefit from speaker information.