Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 92% Match Research Paper AI Researchers,Video Content Creators,Game Developers,Advertising Professionals 2 weeks ago

TGT: Text-Grounded Trajectories for Locally Controlled Video Generation

generative-ai › diffusion
📄 Abstract

Abstract: Text-to-video generation has advanced rapidly in visual fidelity, whereas standard methods still have limited ability to control the subject composition of generated scenes. Prior work shows that adding localized text control signals, such as bounding boxes or segmentation masks, can help. However, these methods struggle in complex scenarios and degrade in multi-object settings, offering limited precision and lacking a clear correspondence between individual trajectories and visual entities as the number of controllable objects increases. We introduce Text-Grounded Trajectories (TGT), a framework that conditions video generation on trajectories paired with localized text descriptions. We propose Location-Aware Cross-Attention (LACA) to integrate these signals and adopt a dual-CFG scheme to separately modulate local and global text guidance. In addition, we develop a data processing pipeline that produces trajectories with localized descriptions of tracked entities, and we annotate two million high quality video clips to train TGT. Together, these components enable TGT to use point trajectories as intuitive motion handles, pairing each trajectory with text to control both appearance and motion. Extensive experiments show that TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared with prior approaches. Website: https://textgroundedtraj.github.io.
Authors (11)
Guofeng Zhang
Angtian Wang
Jacob Zhiyuan Fang
Liming Jiang
Haotian Yang
Bo Liu
+5 more
Submitted
October 16, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

TGT introduces a framework for text-to-video generation that uses Text-Grounded Trajectories (TGT) to provide precise, localized control over individual object movements. It employs Location-Aware Cross-Attention (LACA) and a dual-CFG scheme to effectively integrate these trajectory signals, overcoming limitations of prior methods in complex, multi-object scenarios.

Business Value

Enables creators to generate highly specific and controllable video content, streamlining workflows for advertising, social media, and entertainment production.