Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent advances in Artificial Intelligence Generated Content have led to
highly realistic synthetic videos, particularly in human-centric scenarios
involving speech, gestures, and full-body motion, posing serious threats to
information authenticity and public trust. Unlike DeepFake techniques that
focus on localized facial manipulation, human-centric video generation methods
can synthesize entire human bodies with controllable movements, enabling
complex interactions with environments, objects, and even other people.
However, existing detection methods largely overlook the growing risks posed by
such full-body synthetic content. Meanwhile, a growing body of research has
explored leveraging LLMs for interpretable fake detection, aiming to explain
decisions in natural language. Yet these approaches heavily depend on
supervised fine-tuning, which introduces limitations such as annotation bias,
hallucinated supervision, and weakened generalization. To address these
challenges, we propose AvatarShield, a novel multimodal human-centric synthetic
video detection framework that eliminates the need for dense textual
supervision by adopting Group Relative Policy Optimization, enabling LLMs to
develop reasoning capabilities from simple binary labels. Our architecture
combines a discrete vision tower for high-level semantic inconsistencies and a
residual extractor for fine-grained artifact analysis. We further introduce
FakeHumanVid, a large-scale benchmark containing 15K real and synthetic videos
across nine state-of-the-art human generation methods driven by text, pose, or
audio. Extensive experiments demonstrate that AvatarShield outperforms existing
methods in both in-domain and cross-domain settings.