Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: While inference-time scaling through search has revolutionized Large Language
Models, translating these gains to image generation has proven difficult.
Recent attempts to apply search strategies to continuous diffusion models show
limited benefits, with simple random sampling often performing best. We
demonstrate that the discrete, sequential nature of visual autoregressive
models enables effective search for image generation. We show that beam search
substantially improves text-to-image generation, enabling a 2B parameter
autoregressive model to outperform a 12B parameter diffusion model across
benchmarks. Systematic ablations show that this advantage comes from the
discrete token space, which allows early pruning and computational reuse, and
our verifier analysis highlights trade-offs between speed and reasoning
capability. These findings suggest that model architecture, not just scale, is
critical for inference-time optimization in visual generation.
Authors (3)
Erik Riise
Mehmet Onurcan Kaya
Dim P. Papadopoulos
Submitted
October 19, 2025
Key Contributions
Demonstrates that visual autoregressive models, due to their discrete token space, are more amenable to inference-time scaling via search strategies like beam search compared to diffusion models. This allows smaller autoregressive models to outperform larger diffusion models in text-to-image generation.
Business Value
Significantly reduces the computational cost and time required for generating high-quality images, making advanced image generation more accessible for various applications.