arxiv_ai 88% Match Research Paper Computer Vision Researchers,Generative AI Developers,ML Engineers 1 week ago

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

generative-ai › diffusion

📄 Abstract

Abstract: In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 1.36 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code is available at https://github.com/CVMI-Lab/VFMTok.

Authors (8)

Anlin Zheng

Xin Wen

Xuanyang Zhang

Chuofan Ma

Tiancai Wang

Gang Yu

+2 more

Submitted

July 11, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Proposes a novel approach to build image tokenizers directly on top of frozen vision foundation models, enhancing their effectiveness with region-adaptive quantization and a semantic reconstruction objective. This method, VFMTok, significantly improves image reconstruction and generation quality, boosts token efficiency, accelerates AR generation convergence by three times, and enables high-fidelity synthesis.

Business Value

Enables faster and higher-quality image generation for applications like digital art, game development, and synthetic data creation. The improved token efficiency can also lead to reduced computational costs.

Paper Metadata

Innovation Type

Methodology/Architecture

Deployment Feasibility

Moderate, requires integration with existing vision foundation models and autoregressive generation frameworks.

Limitations Addressed

Addresses the underexplored area of building image tokenizers on frozen vision foundation models. It overcomes issues of redundancy in pre-trained features and semantic fidelity loss by introducing region-adaptive quantization and a semantic reconstruction objective, leading to better token efficiency and generation quality.

Performance Gains

Achieves a gFID of 1.36 on ImageNet, accelerates model convergence by three times, and enhances token efficiency.

Technical Tags

vision foundation modelsimage tokenizerautoregressive generationregion-adaptive quantizationsemantic reconstructiontoken efficiencyfrozen encodergFIDclass-conditional synthesisfeature alignment

Research Topics

Image GenerationComputer VisionFoundation ModelsTokenizationGenerative Models

Methods & Architectures

Using frozen vision foundation models as encodersRegion-adaptive quantizationSemantic reconstruction objectiveAutoregressive (AR) generation Vision Foundation ModelsAutoregressive ModelsImage Tokenizers

Applications & Tasks

Image Synthesis Computer Graphics Content Creation Medical Imaging Inefficient image tokenizationLoss of semantic fidelity during tokenizationSlow convergence in autoregressive image generationDifficulty in achieving high-fidelity synthesis Building effective image tokenizers from frozen vision modelsImproving image reconstruction and generation qualityEnhancing token efficiencyAccelerating autoregressive model convergence

Datasets & Benchmarks

Datasets

ImageNet

Benchmarks

gFID of 1.36 on ImageNet

gFIDImage reconstruction qualityGeneration qualityToken efficiencyConvergence speed

Related Fields

Computer VisionDeep LearningGenerative ModelsNatural Language Processing (for tokenization concepts)

Keywords

image generationvision foundation modelstokenizerautoregressivequantizationsemantic reconstructiontoken efficiencyfrozen modelsImageNetgFIDcomputer visiongenerative AIdiffusion models

Academic Context

#Image Generation#Computer Vision#Foundation Models#Tokenization#Generative Models

Commercial Potential

Potential Products

High-fidelity image generation toolsEfficient image tokenization librariesFoundation model-based image synthesis platforms

Target Industries

Media & EntertainmentGamingAdvertisingE-commerceTechnology

Use Case Examples

Generating realistic product images for e-commerce.Creating diverse assets for video games.Synthesizing training data for other computer vision tasks.

Competitive Edge

Offers a new paradigm for image tokenization by leveraging frozen vision foundation models, aiming for higher quality and efficiency compared to traditional methods.

Market Opportunity

Rapidly growing market for generative AI and image synthesis.

Revenue Models

API access to the tokenizer/generatorlicensing of the technology.

Resource Requirements

Compute Needs

High, for training and fine-tuning foundation models and generative components.

Data Requirements

Large-scale image datasets (e.g., ImageNet) for training and evaluation.

Deployment Constraints

Requires access to powerful vision foundation models and significant computational resources for generation.

Scalability

Scalability depends on the underlying foundation model and the efficiency of the tokenization and generation process.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years for practical applications.

Patent Potential

Moderate, for the novel tokenizer design and training methodology.

View Full Paper Back to Papers