arxiv_cv 95% Match Research Paper Researchers in Generative AI,AI Alignment Researchers,Developers of Diffusion Models,ML Engineers 1 month ago

Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

generative-ai › diffusion

📄 Abstract

Abstract: Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically use Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we show that pre-trained diffusion models are naturally suited for step-level reward modeling in the noisy latent space, as they are explicitly designed to process latent images at various noise levels. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of the diffusion model to predict preferences of latent images at arbitrary timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a step-level preference optimization method conducted directly in the noisy latent space. Experimental results indicate that LPO significantly improves the model's alignment with general, aesthetic, and text-image alignment preferences, while achieving a 2.5-28x training speedup over existing preference optimization methods. Our code and models are available at https://github.com/Kwai-Kolors/LPO.

Key Contributions

Proposes a Latent Reward Model (LRM) and Latent Preference Optimization (LPO) method that repurposes diffusion models for step-level preference optimization directly in the noisy latent space. This avoids issues with pixel-level rewards and noisy images, aligning models more effectively with human preferences.

Business Value

Enables the creation of diffusion models that generate images more closely aligned with user expectations and aesthetic preferences, leading to higher quality and more desirable outputs for creative applications.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

Moderate. Requires integrating preference learning into the diffusion model training pipeline, which can be complex but leverages existing diffusion model components.

Limitations Addressed

Challenges of using VLMs as pixel-level reward models for step-level optimization,Difficulty handling noisy images and complex transformations,Inefficiency of optimizing in pixel space

Technical Tags

diffusion modelspreference optimizationlatent reward modelstep-level optimizationnoisy latent spaceVision-Language Models (VLMs)pixel-level rewardlatent preference optimization (LPO)arbitrary timestepsnoise-aware

Research Topics

Generative AIDiffusion ModelsReinforcement Learning from Human Feedback (RLHF)Preference LearningModel Alignment

Methods & Architectures

Latent Reward Model (LRM)Latent Preference Optimization (LPO)Repurposing diffusion model componentsOptimization in noisy latent space Diffusion Models

Applications & Tasks

Image Generation AI Alignment Content Creation Challenges in using VLMs for step-level preference optimizationHandling noisy images at different timestepsComplex transformations required for pixel-level rewards Aligning diffusion models with human preferencesStep-level preference optimization

Related Fields

Generative AIDeep LearningReinforcement LearningAI SafetyComputer Vision

Keywords

Diffusion ModelsPreference OptimizationLatent SpaceReward ModelRLHFGenerative AIAlignmentStep-LevelLPOLRMNoise-Aware

Academic Context

#Generative AI#Diffusion Models#Reinforcement Learning from Human Feedback (RLHF)#Preference Learning#Model Alignment

Commercial Potential

Potential Products

Diffusion models with improved user alignmentTools for fine-tuning generative models based on feedbackPlatforms for creating more aesthetically pleasing AI-generated content

Target Industries

Media and EntertainmentGamingAdvertisingDesignTechnology

Use Case Examples

Generating images that better match user descriptions and stylistic preferencesFine-tuning image generation models for specific artistic stylesCreating AI-generated art that resonates with human aesthetics

Competitive Edge

Offers a novel approach to preference optimization for diffusion models by operating directly in the latent space, overcoming limitations of previous pixel-level methods.

Resource Requirements

Compute Needs

Training requires significant compute, similar to training diffusion models, plus the preference optimization step.

Data Requirements

Requires datasets of human preferences (e.g., pairwise comparisons) for training the reward model.

Deployment Constraints

The complexity of preference learning and the potential for reward hacking are considerations.

Scalability

Leveraging existing diffusion model components suggests potential for scalability, but preference data collection can be a bottleneck.

View Full Paper Back to Papers