Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 95% Match Research Paper Machine learning researchers,AI engineers,Computer vision practitioners 2 weeks ago

REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers

generative-ai › vae
📄 Abstract

Abstract: In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.
Authors (6)
Xingjian Leng
Jaskirat Singh
Yunzhong Hou
Zhenchang Xing
Saining Xie
Liang Zheng
Submitted
April 14, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

This paper introduces REPA-E, a novel training recipe that enables end-to-end training of Latent Diffusion Models (LDMs) with their VAE tokenizer. By using a Representation Alignment (REPA) loss instead of standard diffusion loss, REPA-E significantly speeds up training (up to 45x) and improves VAE performance.

Business Value

Drastically reduces the time and computational resources needed to train powerful generative models like LDMs, making them more accessible and faster to iterate on for various applications.