Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Preference optimization for diffusion models aims to align them with human
preferences for images. Previous methods typically use Vision-Language Models
(VLMs) as pixel-level reward models to approximate human preferences. However,
when used for step-level preference optimization, these models face challenges
in handling noisy images of different timesteps and require complex
transformations into pixel space. In this work, we show that pre-trained
diffusion models are naturally suited for step-level reward modeling in the
noisy latent space, as they are explicitly designed to process latent images at
various noise levels. Accordingly, we propose the Latent Reward Model (LRM),
which repurposes components of the diffusion model to predict preferences of
latent images at arbitrary timesteps. Building on LRM, we introduce Latent
Preference Optimization (LPO), a step-level preference optimization method
conducted directly in the noisy latent space. Experimental results indicate
that LPO significantly improves the model's alignment with general, aesthetic,
and text-image alignment preferences, while achieving a 2.5-28x training
speedup over existing preference optimization methods. Our code and models are
available at https://github.com/Kwai-Kolors/LPO.
Key Contributions
Proposes a Latent Reward Model (LRM) and Latent Preference Optimization (LPO) method that repurposes diffusion models for step-level preference optimization directly in the noisy latent space. This avoids issues with pixel-level rewards and noisy images, aligning models more effectively with human preferences.
Business Value
Enables the creation of diffusion models that generate images more closely aligned with user expectations and aesthetic preferences, leading to higher quality and more desirable outputs for creative applications.