Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Masked autoregressive models (MAR) have recently emerged as a powerful
paradigm for image and video generation, combining the flexibility of masked
modeling with the potential of continuous tokenizer. However, video MAR models
suffer from two major limitations: the slow-start problem, caused by the lack
of a structured global prior at early sampling stages, and error accumulation
across the autoregression in both spatial and temporal dimensions. In this
work, we propose CanvasMAR, a novel video MAR model that mitigates these issues
by introducing a canvas mechanism--a blurred, global prediction of the next
frame, used as the starting point for masked generation. The canvas provides
global structure early in sampling, enabling faster and more coherent frame
synthesis. Furthermore, we introduce compositional classifier-free guidance
that jointly enlarges spatial (canvas) and temporal conditioning, and employ
noise-based canvas augmentation to enhance robustness. Experiments on the BAIR
and Kinetics-600 benchmarks demonstrate that CanvasMAR produces high-quality
videos with fewer autoregressive steps. Our approach achieves remarkable
performance among autoregressive models on Kinetics-600 dataset and rivals
diffusion-based methods.
Submitted
October 15, 2025
Key Contributions
This paper introduces CanvasMAR, a novel video MAR model that addresses the slow-start problem and error accumulation by employing a 'canvas mechanism'—a blurred, global prediction of the next frame. This canvas provides early global structure, enabling faster and more coherent frame synthesis. It also introduces compositional classifier-free guidance and noise-based canvas augmentation for enhanced robustness.
Business Value
Enables the creation of higher-quality and more efficient video generation tools, benefiting industries like media, entertainment, advertising, and gaming by reducing production time and costs for visual content.