Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
π Abstract
Abstract: Despite rapid advances in text-to-video synthesis, generated video quality
remains critically dependent on precise user prompts. Existing test-time
optimization methods, successful in other domains, struggle with the
multi-faceted nature of video. In this work, we introduce VISTA (Video
Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously
improves video generation through refining prompts in an iterative loop. VISTA
first decomposes a user idea into a structured temporal plan. After generation,
the best video is identified through a robust pairwise tournament. This winning
video is then critiqued by a trio of specialized agents focusing on visual,
audio, and contextual fidelity. Finally, a reasoning agent synthesizes this
feedback to introspectively rewrite and enhance the prompt for the next
generation cycle. Experiments on single- and multi-scene video generation
scenarios show that while prior methods yield inconsistent gains, VISTA
consistently improves video quality and alignment with user intent, achieving
up to 60% pairwise win rate against state-of-the-art baselines. Human
evaluators concur, preferring VISTA outputs in 66.4% of comparisons.
Authors (6)
Do Xuan Long
Xingchen Wan
Hootan Nakhost
Chen-Yu Lee
Tomas Pfister
Sercan Γ. ArΔ±k
Submitted
October 17, 2025
Key Contributions
VISTA is a novel multi-agent system that autonomously improves text-to-video generation at test-time through iterative prompt refinement. It decomposes user ideas into temporal plans, uses a tournament to select the best video, critiques it with specialized agents, and employs a reasoning agent to rewrite prompts for subsequent generations, leading to consistent quality improvements.
Business Value
Significantly enhances the quality and control of AI-generated videos, making it a powerful tool for content creators, marketers, and filmmakers by reducing the need for expert prompt engineering and iterative manual adjustments.