Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper AI Researchers,Machine Learning Engineers,Developers of generative models 1 day ago

Scalable Autoregressive Image Generation with Mamba

generative-ai › autoregressive
📄 Abstract

Abstract: We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture. AiM employs Mamba, a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time complexity, to supplant the commonly utilized Transformers in AR image generation models, aiming to achieve both superior generation quality and enhanced inference speed. Unlike existing methods that adapt Mamba to handle two-dimensional signals via multi-directional scan, AiM directly utilizes the next-token prediction paradigm for autoregressive image generation. This approach circumvents the need for extensive modifications to enable Mamba to learn 2D spatial representations. By implementing straightforward yet strategically targeted modifications for visual generative tasks, we preserve Mamba's core structure, fully exploiting its efficient long-sequence modeling capabilities and scalability. We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B. On the ImageNet1K 256*256 benchmark, our best AiM model achieves a FID of 2.21, surpassing all existing AR models of comparable parameter counts and demonstrating significant competitiveness against diffusion models, with 2 to 10 times faster inference speed. Code is available at https://github.com/hp-l33/AiM
Authors (7)
Haopeng Li
Jinyue Yang
Kexin Wang
Xuerui Qiu
Yuhong Chou
Xin Li
+1 more
Submitted
August 22, 2024
arXiv Category
cs.CV
arXiv PDF

Key Contributions

AiM introduces a Mamba-based autoregressive model for image generation, replacing Transformers to achieve superior quality and enhanced inference speed due to Mamba's linear time complexity for long sequences. It directly uses next-token prediction for 2D signals, avoiding complex adaptations.

Business Value

Enables faster and more scalable generation of high-quality images, beneficial for applications requiring rapid content creation or large-scale synthetic data generation.