arxiv_cv 92% Match Research Paper AI Researchers,Video Content Creators,Game Developers,Advertising Professionals 2 weeks ago

TGT: Text-Grounded Trajectories for Locally Controlled Video Generation

generative-ai › diffusion

📄 Abstract

Abstract: Text-to-video generation has advanced rapidly in visual fidelity, whereas standard methods still have limited ability to control the subject composition of generated scenes. Prior work shows that adding localized text control signals, such as bounding boxes or segmentation masks, can help. However, these methods struggle in complex scenarios and degrade in multi-object settings, offering limited precision and lacking a clear correspondence between individual trajectories and visual entities as the number of controllable objects increases. We introduce Text-Grounded Trajectories (TGT), a framework that conditions video generation on trajectories paired with localized text descriptions. We propose Location-Aware Cross-Attention (LACA) to integrate these signals and adopt a dual-CFG scheme to separately modulate local and global text guidance. In addition, we develop a data processing pipeline that produces trajectories with localized descriptions of tracked entities, and we annotate two million high quality video clips to train TGT. Together, these components enable TGT to use point trajectories as intuitive motion handles, pairing each trajectory with text to control both appearance and motion. Extensive experiments show that TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared with prior approaches. Website: https://textgroundedtraj.github.io.

Authors (11)

Guofeng Zhang

Angtian Wang

Jacob Zhiyuan Fang

Liming Jiang

Haotian Yang

Bo Liu

+5 more

Submitted

October 16, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

TGT introduces a framework for text-to-video generation that uses Text-Grounded Trajectories (TGT) to provide precise, localized control over individual object movements. It employs Location-Aware Cross-Attention (LACA) and a dual-CFG scheme to effectively integrate these trajectory signals, overcoming limitations of prior methods in complex, multi-object scenarios.

Business Value

Enables creators to generate highly specific and controllable video content, streamlining workflows for advertising, social media, and entertainment production.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

Moderate to High. Diffusion models are becoming more efficient, but real-time generation for complex scenes can still be computationally intensive.

Limitations Addressed

Limited ability of standard text-to-video methods to control subject composition and individual object trajectories, especially in complex, multi-object settings, and the lack of clear correspondence between visual entities and their movements.

Performance Gains

Improved precision and control in multi-object video generation compared to prior methods.

Technical Tags

text-to-video generationcontrolled generationobject trajectorieslocalized text controlLocation-Aware Cross-Attentiondual-CFGdata processing pipelinemulti-object settings

Research Topics

Generative ModelsVideo SynthesisNatural Language ProcessingComputer VisionHuman-Computer Interaction

Methods & Architectures

Diffusion ModelsCross-Attention MechanismsTrajectory PredictionData Augmentation/Processing Diffusion ModelsTransformer

Applications & Tasks

Video Editing Content Creation Advertising Film Production Gaming Conditional Video GenerationControllable SynthesisMulti-object Tracking and Animation Generating videos based on text and object trajectoriesControlling specific object movements in generated videos

Related Fields

Generative AIComputer VisionNatural Language ProcessingMachine Learning

Keywords

text-to-videovideo generationcontrolled generationobject trajectoriesdiffusion modelscross-attentionvideo synthesismulti-object controlanimationvisual storytelling

Academic Context

#Generative Models#Video Synthesis#Natural Language Processing#Computer Vision#Human-Computer Interaction

Commercial Potential

Potential Products

AI-powered video editing toolsAutomated animation softwarePersonalized video ad generators

Target Industries

Media and EntertainmentAdvertisingGamingSocial Media

Use Case Examples

Generating a video of a specific car driving along a predefined pathCreating animated scenes with multiple characters following distinct trajectoriesProducing short promotional videos with controlled object movements

Competitive Edge

Offers more precise control over object motion in text-to-video generation than previous methods, particularly for complex scenes with multiple objects.

Market Opportunity

Rapidly growing market for AI-generated video content.

Revenue Models

SaaS for content creationlicensing of models.

Resource Requirements

Compute Needs

Significant GPU resources for training and inference.

Data Requirements

Paired text descriptions and object trajectories from videos.

Deployment Constraints

Computational cost for generation.

Scalability

Scales to multi-object scenarios and longer video generation.

Production Readiness

Maturity Level

Research Prototype

Time to Market

1-2 years

View Full Paper Back to Papers