arxiv_ml 95% Match Research Paper Machine learning researchers,AI engineers,Computer vision practitioners 2 weeks ago

REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers

generative-ai › vae

📄 Abstract

Abstract: In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.

Authors (6)

Xingjian Leng

Jaskirat Singh

Yunzhong Hou

Zhenchang Xing

Saining Xie

Liang Zheng

Submitted

April 14, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

This paper introduces REPA-E, a novel training recipe that enables end-to-end training of Latent Diffusion Models (LDMs) with their VAE tokenizer. By using a Representation Alignment (REPA) loss instead of standard diffusion loss, REPA-E significantly speeds up training (up to 45x) and improves VAE performance.

Business Value

Drastically reduces the time and computational resources needed to train powerful generative models like LDMs, making them more accessible and faster to iterate on for various applications.

Paper Metadata

Innovation Type

Training Methodology

Deployment Feasibility

High. Improves the training process, making deployment of models trained with this method more feasible.

Limitations Addressed

Standard diffusion loss is ineffective for end-to-end training of VAEs and diffusion models,Slow training times for latent diffusion models,Performance degradation when attempting end-to-end training

Performance Gains

Achieves significant training speedups (up to 45x) and improves VAE performance while enabling end-to-end training.

Technical Tags

Latent Diffusion ModelsVariational Autoencoder (VAE)End-to-End TrainingRepresentation Alignment (REPA) lossDiffusion LossLatent Diffusion TransformersTraining SpeedupImage GenerationGenerative Models

Research Topics

Generative ModelsDeep LearningComputer VisionMachine Learning Optimization

Methods & Architectures

Representation Alignment (REPA) lossEnd-to-end trainingJoint training of VAE and diffusion model Latent Diffusion ModelsVariational Autoencoder (VAE)Transformers

Applications & Tasks

Image Generation Computer Vision Ineffectiveness of standard diffusion loss for end-to-end VAE+diffusion trainingSlow training of latent diffusion modelsDegradation in performance with end-to-end training Jointly training VAE and latent diffusion modelsAccelerating training of latent diffusion modelsImproving performance of end-to-end trained models

Datasets & Benchmarks

Benchmarks

Training speedup: 17x over REPA, 45x over vanilla training

Training timePerformance metrics (e.g., FID score)VAE performance

Related Fields

Generative ModelsDeep LearningComputer VisionMachine Learning Optimization

Keywords

Latent Diffusion ModelsVAEEnd-to-End TrainingGenerative AIDiffusion ModelsREPA lossTraining SpeedImage GenerationTransformers

Academic Context

#Generative Models#Deep Learning#Computer Vision#Machine Learning Optimization

Technology Stack

Frameworks & Libraries

VAEDiffusion ModelsTransformers

Commercial Potential

Potential Products

Faster image generation toolsCustomizable generative models for creative industriesEfficient AI model training platforms

Target Industries

Media and EntertainmentAdvertisingGamingDesign

Use Case Examples

Rapidly generating diverse images for marketing campaignsCreating custom character designs for video gamesAccelerating the development of AI art tools

Competitive Edge

Offers a significantly faster and more effective end-to-end training method for latent diffusion models compared to existing approaches.

Market Opportunity

Explosive growth in the generative AI market.

Revenue Models

Licensing of training technologydevelopment of faster generative model platforms.

Resource Requirements

Compute Needs

Reduced training compute requirements due to speedup.

Data Requirements

Requires standard image datasets for training generative models.

Scalability

The training speedup contributes to better scalability of model development.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for integration into generative AI platforms

Patent Potential

Moderate, for the REPA-E training methodology.

View Full Paper Back to Papers