Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 88% Match Research Paper Computer Vision Researchers,Generative AI Developers,ML Engineers 1 week ago

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

generative-ai › diffusion
📄 Abstract

Abstract: In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 1.36 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code is available at https://github.com/CVMI-Lab/VFMTok.
Authors (8)
Anlin Zheng
Xin Wen
Xuanyang Zhang
Chuofan Ma
Tiancai Wang
Gang Yu
+2 more
Submitted
July 11, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Proposes a novel approach to build image tokenizers directly on top of frozen vision foundation models, enhancing their effectiveness with region-adaptive quantization and a semantic reconstruction objective. This method, VFMTok, significantly improves image reconstruction and generation quality, boosts token efficiency, accelerates AR generation convergence by three times, and enables high-fidelity synthesis.

Business Value

Enables faster and higher-quality image generation for applications like digital art, game development, and synthetic data creation. The improved token efficiency can also lead to reduced computational costs.