Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Existing DeepFake detection techniques primarily focus on facial
manipulations, such as face-swapping or lip-syncing. However, advancements in
text-to-video (T2V) and image-to-video (I2V) generative models now allow fully
AI-generated synthetic content and seamless background alterations, challenging
face-centric detection methods and demanding more versatile approaches.
To address this, we introduce the \underline{U}niversal \underline{N}etwork
for \underline{I}dentifying \underline{T}ampered and synth\underline{E}tic
videos (\texttt{UNITE}) model, which, unlike traditional detectors, captures
full-frame manipulations. \texttt{UNITE} extends detection capabilities to
scenarios without faces, non-human subjects, and complex background
modifications. It leverages a transformer-based architecture that processes
domain-agnostic features extracted from videos via the SigLIP-So400M foundation
model. Given limited datasets encompassing both facial/background alterations
and T2V/I2V content, we integrate task-irrelevant data alongside standard
DeepFake datasets in training. We further mitigate the model's tendency to
over-focus on faces by incorporating an attention-diversity (AD) loss, which
promotes diverse spatial attention across video frames. Combining AD loss with
cross-entropy improves detection performance across varied contexts.
Comparative evaluations demonstrate that \texttt{UNITE} outperforms
state-of-the-art detectors on datasets (in cross-data settings) featuring
face/background manipulations and fully synthetic T2V/I2V videos, showcasing
its adaptability and generalizable detection capabilities.