Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We introduce OneCAT, a unified multimodal model that seamlessly integrates
understanding, generation, and editing within a novel, pure decoder-only
transformer architecture. Our framework uniquely eliminates the need for
external components such as Vision Transformers (ViT) or vision tokenizer
during inference, leading to significant efficiency gains, especially for
high-resolution inputs. This is achieved through a modality-specific
Mixture-of-Experts (MoE) structure trained with a single autoregressive (AR)
objective, which also natively supports dynamic resolutions. Furthermore, we
pioneer a multi-scale visual autoregressive mechanism within the Large Language
Model (LLM) that drastically reduces decoding steps compared to diffusion-based
methods while maintaining state-of-the-art performance. Our findings
demonstrate the powerful potential of pure autoregressive modeling as a
sufficient and elegant foundation for unified multimodal intelligence. As a
result, OneCAT sets a new performance standard, outperforming existing
open-source unified multimodal models across benchmarks for multimodal
generation, editing, and understanding.