Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper AI Researchers,Generative Model Developers,Video Production Professionals,Animators 1 week ago

Uniform Discrete Diffusion with Metric Path for Video Generation

generative-ai › diffusion
📄 Abstract

Abstract: Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within a single model, including interpolation and image-to-video generation. Extensive experiments on challenging video and image generation benchmarks demonstrate that URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods. Code and models are available at https://github.com/baaivision/URSA
Authors (11)
Haoge Deng
Ting Pan
Fan Zhang
Yang Liu
Zhuoyan Luo
Yufeng Cui
+5 more
Submitted
October 28, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

URSA is a novel framework for uniform discrete diffusion that bridges the gap with continuous approaches for scalable video generation. It introduces a Linearized Metric Path and Resolution-dependent Timestep Shifting to enable efficient high-resolution and long-duration video synthesis, along with an asynchronous fine-tuning strategy for versatile tasks.

Business Value

Enables more efficient and higher-quality video generation, potentially lowering costs and increasing accessibility for creating realistic and long-form video content.