arxiv_cv 95% Match Research Paper Generative AI Researchers,Computer Vision Engineers,ML Practitioners 2 weeks ago

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

generative-ai › diffusion

📄 Abstract

Abstract: We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.

Authors (7)

Ming Gui

Johannes Schusterbauer

Timy Phan

Felix Krause

Josh Susskind

Miguel Angel Bautista

+1 more

Submitted

October 16, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

RepTok is a generative framework that represents images using a single continuous latent token from self-supervised vision transformers. By fine-tuning the token embedding and using flow matching, it enables faithful reconstruction and efficient generation, preserving the SSL space geometry with a cosine-similarity loss.

Business Value

Enables more efficient and cost-effective generation of high-quality images for various applications, including creative tools and data augmentation.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

High, as it builds upon pre-trained SSL models and uses standard generative techniques.

Limitations Addressed

Existing generative models often have high training costs and spatial redundancies in their latent spaces. RepTok's single-token formulation resolves these issues, leading to significant efficiency gains.

Performance Gains

Significantly reduced training costs,Competitive generation results

Technical Tags

generative modelingself-supervised representationslatent spaceflow matchingvision transformerssingle token representationimage generationrepresentation learningcosine similarity loss

Research Topics

Generative ModelsRepresentation LearningComputer VisionDeep Learning

Methods & Architectures

Representation Tokenizer (RepTok)Flow matching objectiveSelf-supervised SSL encoderCosine-similarity loss Vision TransformersGenerative Decoders

Applications & Tasks

Image Generation Computer Vision Content Creation Spatial redundancies in latent spacesHigh training costsFaithful image reconstruction Efficient image generationClass-conditional generationText-to-image synthesis

Datasets & Benchmarks

Datasets

ImageNet

Generation quality (competitive results)

Related Fields

Generative AIComputer VisionMachine LearningRepresentation Learning

Keywords

generative modelsself-supervisedrepresentation learninglatent spaceflow matchingvision transformersingle tokenimage generationefficientcosine similarity

Academic Context

#Generative Models#Representation Learning#Computer Vision#Deep Learning

Commercial Potential

Potential Products

Efficient image generation toolsAI models for creative content generation

Target Industries

MediaAdvertisingGamingDesign

Use Case Examples

Generating diverse images for marketing campaignsCreating synthetic data for training other modelsText-to-image synthesis with improved efficiency

Competitive Edge

Offers a more efficient approach to generative modeling by using a single latent token derived from powerful self-supervised representations.

Market Opportunity

Large and growing market for generative AI and image synthesis.

Revenue Models

Licensing of models and algorithmsintegration into creative platforms.

Resource Requirements

Compute Needs

Moderate, requires GPU for training and generation.

Data Requirements

Requires datasets for training generative models (e.g., ImageNet) and pre-trained SSL encoders.

Scalability

Scalable due to the efficient single-token formulation.

Production Readiness

Maturity Level

Research

Time to Market

Medium

View Full Paper Back to Papers