Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper AI Researchers,Computer Vision Engineers,Generative Model Developers 2 weeks ago

Latent Diffusion Model without Variational Autoencoder

generative-ai › diffusion
📄 Abstract

Abstract: Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations. Code and interpretations are available at https://howlin-wang.github.io/svg/.
Authors (9)
Minglei Shi
Haolin Wang
Wenzhao Zheng
Ziyang Yuan
Xiaoshi Wu
Xintao Wang
+3 more
Submitted
October 17, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Introduces SVG, a novel latent diffusion model that eliminates the need for VAEs, addressing limitations in training efficiency, inference speed, and transferability. It leverages self-supervised DINO features to create a semantically discriminative latent space, improving stability and performance for visual generation tasks.

Business Value

Enables faster and more efficient generation of high-quality images, potentially lowering costs for content creation, design, and synthetic data generation. Improved transferability could lead to broader applications in various vision tasks.