Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We introduce Representation Tokenizer (RepTok), a generative modeling
framework that represents an image using a single continuous latent token
obtained from self-supervised vision transformers. Building on a pre-trained
SSL encoder, we fine-tune only the semantic token embedding and pair it with a
generative decoder trained jointly using a standard flow matching objective.
This adaptation enriches the token with low-level, reconstruction-relevant
details, enabling faithful image reconstruction. To preserve the favorable
geometry of the original SSL space, we add a cosine-similarity loss that
regularizes the adapted token, ensuring the latent space remains smooth and
suitable for generation. Our single-token formulation resolves spatial
redundancies of 2D latent spaces and significantly reduces training costs.
Despite its simplicity and efficiency, RepTok achieves competitive results on
class-conditional ImageNet generation and naturally extends to text-to-image
synthesis, reaching competitive zero-shot performance on MS-COCO under
extremely limited training budgets. Our findings highlight the potential of
fine-tuned SSL representations as compact and effective latent spaces for
efficient generative modeling.
Authors (7)
Ming Gui
Johannes Schusterbauer
Timy Phan
Felix Krause
Josh Susskind
Miguel Angel Bautista
+1 more
Submitted
October 16, 2025
Key Contributions
RepTok is a generative framework that represents images using a single continuous latent token from self-supervised vision transformers. By fine-tuning the token embedding and using flow matching, it enables faithful reconstruction and efficient generation, preserving the SSL space geometry with a cosine-similarity loss.
Business Value
Enables more efficient and cost-effective generation of high-quality images for various applications, including creative tools and data augmentation.