Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Unified, generalizable semantic control in video generation remains a
critical open challenge. Existing methods either introduce artifacts by
enforcing inappropriate pixel-wise priors from structure-based controls, or
rely on non-generalizable, condition-specific finetuning or task-specific
architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes
this problem as in-context generation. VAP leverages a reference video as a
direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via
a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture
prevents catastrophic forgetting and is guided by a temporally biased position
embedding that eliminates spurious mapping priors for robust context retrieval.
To power this approach and catalyze future research, we built VAP-Data, the
largest dataset for semantic-controlled video generation with over 100K paired
videos across 100 semantic conditions. As a single unified model, VAP sets a
new state-of-the-art for open-source methods, achieving a 38.7% user preference
rate that rivals leading condition-specific commercial models. VAP's strong
zero-shot generalization and support for various downstream applications mark a
significant advance toward general-purpose, controllable video generation.
Authors (7)
Yuxuan Bian
Xin Chen
Zenan Li
Tiancheng Zhi
Shen Sang
Linjie Luo
+1 more
Submitted
October 23, 2025
Key Contributions
Introduces Video-As-Prompt (VAP), a new paradigm that uses a reference video as a direct semantic prompt for video generation via a frozen DiT model and a MoT expert. This enables unified, generalizable semantic control without task-specific finetuning or artifacts, powered by the large VAP-Data dataset.
Business Value
Revolutionizes video content creation by enabling precise semantic control over generated videos, making high-quality video production more accessible and efficient for various industries.