arxiv_ai 95% Match Research Paper AI Researchers,Computer Vision Engineers,Generative Model Developers 2 weeks ago

Latent Diffusion Model without Variational Autoencoder

generative-ai › diffusion

📄 Abstract

Abstract: Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations. Code and interpretations are available at https://howlin-wang.github.io/svg/.

Authors (9)

Minglei Shi

Haolin Wang

Wenzhao Zheng

Ziyang Yuan

Xiaoshi Wu

Xintao Wang

+3 more

Submitted

October 17, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces SVG, a novel latent diffusion model that eliminates the need for VAEs, addressing limitations in training efficiency, inference speed, and transferability. It leverages self-supervised DINO features to create a semantically discriminative latent space, improving stability and performance for visual generation tasks.

Business Value

Enables faster and more efficient generation of high-quality images, potentially lowering costs for content creation, design, and synthetic data generation. Improved transferability could lead to broader applications in various vision tasks.

Paper Metadata

Innovation Type

Novel Architecture/Methodology

Deployment Feasibility

Moderate to High. Removing VAEs simplifies the architecture and potentially improves efficiency, making deployment more feasible.

Limitations Addressed

Limited training efficiency of VAE+diffusion models,Slow inference in VAE+diffusion models,Poor transferability of VAE+diffusion models to other vision tasks,Lack of clear semantic separation and discriminative structure in VAE latent spaces

Technical Tags

Latent Diffusion ModelsVariational Autoencoders (VAE)Self-Supervised LearningDINO featuresVisual GenerationImage SynthesisTransfer LearningGenerative Models

Research Topics

Generative ModelsComputer VisionDeep LearningRepresentation LearningDiffusion Models

Methods & Architectures

Latent Diffusion Model (LDM)Self-supervised representations (DINO)Frozen feature extractorsResidual branchVAE-free latent space SVG (Self-supervised Visual Generation)Latent Diffusion Model

Applications & Tasks

Image Generation Computer Vision Content Creation Training EfficiencyInference SpeedTransferabilityLatent Space Quality High-fidelity visual generationImage synthesis

Related Fields

Computer VisionGenerative AIDeep LearningRepresentation Learning

Keywords

latent diffusion modelsgenerative modelsimage generationVAEself-supervised learningDINOSVGcomputer visiondeep learningtransfer learninginference speedtraining efficiency

Academic Context

#Generative Models#Computer Vision#Deep Learning#Representation Learning#Diffusion Models

Commercial Potential

Potential Products

Image generation toolsSynthetic data generation platformsCreative AI tools

Target Industries

Media & EntertainmentAdvertisingGamingDesignRobotics (for synthetic data)

Use Case Examples

Generating realistic images for marketing campaignsCreating diverse datasets for training other AI modelsAssisting artists and designers with visual creation

Competitive Edge

Offers an alternative to VAE-based latent diffusion models, aiming for better efficiency and transferability without sacrificing generation quality.

Market Opportunity

Large and growing market for generative AI and image synthesis.

Revenue Models

Licensing of modelsAPI accessSaaS products.

Resource Requirements

Compute Needs

High, typical for training diffusion models.

Data Requirements

Large-scale image datasets.

Deployment Constraints

Requires significant computational resources for inference, though potentially less than VAE+diffusion.

Scalability

Scalability for training and inference depends on the underlying diffusion model implementation and hardware.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers