arxiv_cv 95% Match Research Paper Cybersecurity Professionals,Media Forensics Experts,AI Researchers,Content Moderation Teams 2 days ago

Referee: Reference-aware Audiovisual Deepfake Detection

ai-safety › robustness

📄 Abstract

Abstract: Since deepfakes generated by advanced generative models have rapidly posed serious threats, existing audiovisual deepfake detection approaches struggle to generalize to unseen forgeries. We propose a novel reference-aware audiovisual deepfake detection method, called Referee. Speaker-specific cues from only one-shot examples are leveraged to detect manipulations beyond spatiotemporal artifacts. By matching and aligning identity-related queries from reference and target content into cross-modal features, Referee jointly reasons about audiovisual synchrony and identity consistency. Extensive experiments on FakeAVCeleb, FaceForensics++, and KoDF demonstrate that Referee achieves state-of-the-art performance on cross-dataset and cross-language evaluation protocols. Experimental results highlight the importance of cross-modal identity verification for future deepfake detection. The code is available at https://github.com/ewha-mmai/referee.

Authors (3)

Hyemin Boo

Eunsang Lee

Jiyoung Lee

Submitted

October 31, 2025

arXiv Category

cs.CV

arXiv PDF Code

Key Contributions

Referee is a novel reference-aware audiovisual deepfake detection method that leverages speaker-specific cues from one-shot examples. By matching identity-related queries across modalities, it jointly reasons about audiovisual synchrony and identity consistency, achieving state-of-the-art performance on cross-dataset and cross-language evaluations.

Business Value

Enhances the reliability of digital media by providing advanced tools to detect sophisticated audiovisual deepfakes, crucial for combating misinformation and ensuring secure communication.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

Moderate, requires processing both audio and video streams and managing reference samples.

Limitations Addressed

Poor generalization of existing audiovisual deepfake detectors to unseen forgeries,Inability to detect manipulations beyond simple spatiotemporal artifacts,Need for robust cross-modal identity verification

Performance Gains

State-of-the-art performance on cross-dataset and cross-language evaluation protocols

View Code on GitHub

Technical Tags

audiovisual deepfake detectionreference-awarespeaker-specific cuesone-shot learningcross-modal featuresidentity verificationspatiotemporal artifactscross-dataset generalization

Research Topics

AI SafetyDeepfake DetectionRobustnessComputer VisionSpeech ProcessingMultimodal AI

Methods & Architectures

RefereeReference-aware Audiovisual Deepfake DetectionSpeaker-specific Cue MatchingCross-modal Feature AlignmentIdentity Query Matching

Applications & Tasks

Media Forensics Cybersecurity Content Authenticity Verification Generalization to Unseen ForgeriesDetecting Manipulations Beyond Spatiotemporal ArtifactsCross-Dataset and Cross-Language EvaluationRobustness of Deepfake Detectors Audiovisual Deepfake DetectionIdentity VerificationManipulation Detection

Datasets & Benchmarks

Datasets

FakeAVCeleb, FaceForensics++, KoDF

Related Fields

Digital ForensicsComputer VisionSpeech ProcessingMachine LearningCybersecurity

Keywords

deepfake detectionaudiovisualreference-awarespeaker recognitionidentity verificationcross-modalrobustnessai safetyforensicsgenerative AI

Academic Context

Ewha W. University #AI Safety#Deepfake Detection#Robustness#Computer Vision#Speech Processing#Multimodal AI

Companies & Organizations

Research Institutions

Ewha W. University

Commercial Potential

Potential Products

Deepfake detection softwareMedia authentication servicesSecurity solutions for video conferencing

Target Industries

Media & EntertainmentSocial MediaCybersecurityGovernmentFinance

Use Case Examples

Verifying the authenticity of video evidenceDetecting manipulated political speechesSecuring online identity verification processes

Competitive Edge

Introduces a novel reference-aware approach that focuses on identity consistency and cross-modal reasoning, offering improved generalization compared to methods relying solely on spatiotemporal artifacts.

Market Opportunity

Large and rapidly growing, due to the increasing threat of deepfakes.

Revenue Models

Licensing of the detection technologySaaS for media verification.

Resource Requirements

Compute Needs

Moderate, requires GPU for training and inference.

Data Requirements

Audiovisual datasets with real and deepfake content.

Deployment Constraints

Requires synchronized audio and video streams, potential latency issues.

Scalability

Scalability depends on the efficiency of the cross-modal feature matching and reasoning process.

Regulatory Considerations

Potential for misuse in surveillanceEthical implications of deepfake detection technology

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for integration into security products.

Patent Potential

High, for the reference-aware detection method.

View Full Paper Back to Papers