arxiv_cl 90% Match Research Paper Speech researchers,Machine learning engineers,Biometrics developers,AI researchers 2 weeks ago

DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model

speech-audio › speech-recognition

📄 Abstract

Abstract: Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce DELULU, a speaker-aware self-supervised foundational model that addresses this limitation by integrating external supervision into the pseudo-label generation process. DELULU leverages frame-level embeddings from ReDimNet, a state-of-the-art speaker verification model, to guide the k-means clustering step during pre-training, introducing a strong speaker-discriminative inductive bias that aligns representation learning with speaker identity. The model is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization. DELULU significantly outperforms prior self-supervised learning (SSL) models across a range of speaker-centric tasks, achieving up to 62% relative improvement in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks such as gender, age, accent, and speaker counting. Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.

Authors (3)

Massa Baali

Rita Singh

Bhiksha Raj

Submitted

October 20, 2025

arXiv Category

cs.SD

arXiv PDF

Key Contributions

DELULU is a novel speaker-aware self-supervised speech foundational model that significantly improves speaker discriminative features. It achieves this by integrating external speaker embeddings (from ReDimNet) into the pre-training clustering process, introducing a strong speaker-identity inductive bias. This dual objective approach (masked prediction and denoising) leads to state-of-the-art performance on speaker-centric tasks.

Business Value

Enables more accurate and robust speaker identification and verification systems, crucial for security applications, personalized user experiences, and efficient call center operations. It can also improve the performance of downstream speech tasks that benefit from speaker information.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

Moderate to High. Requires integration into speech processing pipelines. The model size and computational requirements for inference need to be considered.

Limitations Addressed

Existing SSL speech models are limited in capturing speaker-discriminative features,Poor performance of SSL models on speaker verification, diarization, and profiling

Performance Gains

Up to 62% relative improvement on speaker-centric tasks compared to prior SSL models.

Technical Tags

self-supervised learningspeaker verificationspeaker diarizationspeaker profilingfoundational modelrepresentation learninginductive biasmasked predictiondenoisingframe-level embeddings

Research Topics

Speech ProcessingSelf-Supervised LearningSpeaker RecognitionRepresentation LearningFoundational Models

Methods & Architectures

Self-supervised learningIntegration of external supervision (speaker embeddings)K-means clusteringDual objective (masked prediction + denoising)Frame-level embedding extraction (ReDimNet) DELULUReDimNet

Applications & Tasks

Speech Technology Biometrics Security Improving speaker discriminative features in SSL modelsEnhancing performance on speaker-centric tasksDeveloping general-purpose speech foundational models Speaker verificationSpeaker diarizationSpeaker profilingSpeech recognition (indirectly)

Datasets & Benchmarks

Benchmarks

Up to 62% relative improvement in equal error rate (EER) for speaker verification tasks.

Equal Error Rate (EER)Speaker verification accuracySpeaker diarization performanceSpeaker profiling performance

Related Fields

Speech ProcessingMachine LearningSignal ProcessingBiometricsDeep Learning

Keywords

self-supervised learningspeechspeaker recognitionfoundational modelrepresentation learningSSLbiometricsdeep learningaudio processingspeaker verificationspeaker diarization

Academic Context

#Speech Processing#Self-Supervised Learning#Speaker Recognition#Representation Learning#Foundational Models

Commercial Potential

Potential Products

Speaker verification systemsVoice-based authenticationMeeting transcription and speaker diarization toolsPersonalized voice assistants

Target Industries

TechnologySecurityFinanceTelecommunicationsCustomer Service

Use Case Examples

Voice-based login for secure applicationsIdentifying different speakers in a recorded meetingPersonalizing user experiences based on voiceDetecting deepfakes or voice impersonation

Competitive Edge

Outperforms previous self-supervised speech models on speaker-centric tasks by explicitly incorporating speaker identity information during pre-training.

Market Opportunity

The voice biometrics market is growing rapidly, projected to reach billions of dollars.

Revenue Models

Licensing the model/technologyoffering API services for speaker verification.

Resource Requirements

Compute Needs

High for pre-training; moderate for fine-tuning and inference.

Data Requirements

Large, diverse speech datasets for pre-training; speaker-labeled datasets for fine-tuning and evaluation.

Deployment Constraints

Computational resources for inference,Latency requirements for real-time applications

Scalability

Scalable to large datasets and various downstream tasks.

Regulatory Considerations

Privacy concerns related to voice datacompliance with data protection regulations (e.g.GDPR).

Production Readiness

Maturity Level

Research/Development

Time to Market

1-3 years for integration into commercial products.

Patent Potential

Moderate, particularly for the novel integration of speaker embeddings into the SSL pre-training process.

View Full Paper Back to Papers