arxiv_ai 90% Match Research Paper Speech Recognition Researchers,Security Engineers,Telecommunication Companies,Law Enforcement 2 weeks ago

EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection

speech-audio › text-to-speech

📄 Abstract

Abstract: The growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance on lab-generated synthetic speech, they often fail when confronted with physical replay attacks-a common and low-cost form of attack used in practical settings. Our experiments show that models trained on existing datasets exhibit severe performance degradation, with average accuracy dropping to 59.6% when evaluated on replayed audio. To bridge this gap, we present EchoFake, a comprehensive dataset comprising more than 120 hours of audio from over 13,000 speakers, featuring both cutting-edge zero-shot text-to-speech (TTS) speech and physical replay recordings collected under varied devices and real-world environmental settings. Additionally, we evaluate three baseline detection models and show that models trained on EchoFake achieve lower average EERs across datasets, indicating better generalization. By introducing more practical challenges relevant to real-world deployment, EchoFake offers a more realistic foundation for advancing spoofing detection methods.

Authors (3)

Tong Zhang

Yihuan Huang

Yanzhen Ren

Submitted

October 22, 2025

arXiv Category

eess.AS

arXiv PDF

Key Contributions

Introduces EchoFake, a comprehensive dataset (120+ hours, 13,000+ speakers) specifically designed for practical speech deepfake detection, including cutting-edge TTS and physical replay recordings. This dataset addresses the severe performance degradation of existing models on replay attacks, a common real-world threat.

Business Value

Enhances security in voice-based communication systems by providing tools to detect sophisticated deepfake attacks, particularly those involving replay mechanisms. Crucial for preventing fraud and identity theft.

Paper Metadata

Innovation Type

Dataset Creation

Deployment Feasibility

Moderate to High, depending on the integration of the detection models into existing communication infrastructure.

Limitations Addressed

Failure of existing anti-spoofing systems on physical replay attacks, performance degradation on replayed audio.

Performance Gains

Models trained on EchoFake achieve lower average EERs (specific numbers not fully detailed in abstract).

Technical Tags

speech deepfake detectionreplay attacksanti-spoofing systemszero-shot text-to-speechdatasetaudio analysisperformance degradationreal-world scenariostelephone fraudidentity theft

Research Topics

Robust Speech Deepfake DetectionReplay Attack MitigationDataset Creation for Practical ScenariosAudio ForensicsSecurity in Communication

Methods & Architectures

Dataset collectionModel evaluation on replayed audioTraining on EchoFake dataset

Applications & Tasks

Speech Security Telecommunications Fraud Detection Identity Verification Speech Deepfake DetectionRobustness to Replay Attacks Speech Deepfake DetectionSpeaker Verification

Datasets & Benchmarks

Datasets

EchoFake, existing datasets

Benchmarks

Average accuracy dropping to 59.6% on replayed audio (for models trained on existing datasets)

AccuracyEER (Equal Error Rate)

Related Fields

Audio ProcessingMachine LearningCybersecurityDigital Forensics

Keywords

speech deepfakedetectionreplay attackdatasetEchoFakeanti-spoofingTTSaudiosecurityfraudidentity theftperformancerobustness

Academic Context

#Robust Speech Deepfake Detection#Replay Attack Mitigation#Dataset Creation for Practical Scenarios#Audio Forensics#Security in Communication

Commercial Potential

Potential Products

Real-time voice authentication systemsCall center fraud detection softwareDeepfake detection APIs

Target Industries

TelecommunicationsFinanceCustomer ServiceSecurity

Use Case Examples

Preventing phone scamsVerifying user identity in financial transactionsSecuring voice-controlled systems

Competitive Edge

Addresses a critical gap in current deepfake detection by focusing on practical replay attacks, which are often overlooked by systems trained on lab-generated data.

Resource Requirements

Compute Needs

Moderate (for training detection models)

Data Requirements

Large-scale audio dataset with diverse speakers and replay conditions.

Deployment Constraints

Real-time processing requirements,Integration with existing communication systems

Regulatory Considerations

Data privacy (speaker data)

Production Readiness

Maturity Level

Dataset Development

View Full Paper Back to Papers