Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper Video Editors,Content Creators,AI Researchers,Filmmakers 1 week ago

BachVid: Training-Free Video Generation with Consistent Background and Character

generative-ai › diffusion
📄 Abstract

Abstract: Diffusion Transformers (DiTs) have recently driven significant progress in text-to-video (T2V) generation. However, generating multiple videos with consistent characters and backgrounds remains a significant challenge. Existing methods typically rely on reference images or extensive training, and often only address character consistency, leaving background consistency to image-to-video models. We introduce BachVid, the first training-free method that achieves consistent video generation without needing any reference images. Our approach is based on a systematic analysis of DiT's attention mechanism and intermediate features, revealing its ability to extract foreground masks and identify matching points during the denoising process. Our method leverages this finding by first generating an identity video and caching the intermediate variables, and then inject these cached variables into corresponding positions in newly generated videos, ensuring both foreground and background consistency across multiple videos. Experimental results demonstrate that BachVid achieves robust consistency in generated videos without requiring additional training, offering a novel and efficient solution for consistent video generation without relying on reference images or additional training.
Authors (6)
Han Yan
Xibin Song
Yifu Wang
Hongdong Li
Pan Ji
Chao Ma
Submitted
October 24, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

BachVid is the first training-free method for generating multiple videos with consistent characters and backgrounds, without requiring reference images. It leverages the attention mechanism and intermediate features of Diffusion Transformers (DiTs) to extract foreground masks and identify matching points, enabling the caching and injection of variables to ensure consistency across generated videos.

Business Value

Significantly reduces the cost and time for creating consistent video content, enabling faster iteration and production for marketing, entertainment, and social media.