arxiv_ml 95% Match Research Paper AI Researchers,Machine Learning Engineers,Computer Vision Engineers,Generative Art Practitioners 1 week ago

MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency

generative-ai › diffusion

📄 Abstract

Abstract: Current text-to-image generative models are trained on large uncurated datasets to enable diverse generation capabilities. However, this does not align well with user preferences. Recently, reward models have been specifically designed to perform post-hoc selection of generated images and align them to a reward, typically user preference. This discarding of informative data together with the optimizing for a single reward tend to harm diversity, semantic fidelity and efficiency. Instead of this post-processing, we propose to condition the model on multiple reward models during training to let the model learn user preferences directly. We show that this not only dramatically improves the visual quality of the generated images but it also significantly speeds up the training. Our proposed method, called MIRO, achieves state-of-the-art performances on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).

Authors (5)

Nicolas Dufour

Lucas Degeorge

Arijit Ghosh

Vicky Kalogeiton

David Picard

Submitted

October 29, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Proposes MIRO, a method that conditions text-to-image models on multiple reward models during training, rather than using post-hoc selection. This approach significantly improves visual quality and training efficiency, aligning generated images better with user preferences and achieving state-of-the-art performance on compositional benchmarks.

Business Value

Enables the creation of more aesthetically pleasing and contextually relevant images, accelerating creative workflows and improving user engagement in applications relying on image generation.

Paper Metadata

Innovation Type

Training Methodology

Deployment Feasibility

High, as it's a training methodology that can be applied to existing diffusion model architectures.

Limitations Addressed

Misalignment between large uncurated datasets and user preferences; harm to diversity, semantic fidelity, and efficiency caused by post-hoc selection based on a single reward model.

Performance Gains

Dramatically improves visual quality,Significantly speeds up training,Achieves state-of-the-art performance on GenEval, PickAScore, ImageReward, HPSv2.

Technical Tags

Text-to-Image GenerationGenerative ModelsReward ModelsUser PreferencesMulti-Reward ConditioningTraining EfficiencyCompositional GenerationImage Quality

Research Topics

Generative AIComputer VisionDeep LearningImage GenerationHuman-AI Interaction

Methods & Architectures

Multi-Reward Conditioned Pretraining (MIRO)Conditioning on multiple reward models during training Diffusion ModelsGenerative Models

Applications & Tasks

Image Generation Creative Arts Content Creation Design Aligning Text-to-Image Models with User PreferencesImproving Image Quality and Training Efficiency Text-to-Image SynthesisGenerating High-Fidelity ImagesImproving Model Training

Datasets & Benchmarks

Benchmarks

GenEval compositional benchmark • PickAScore • ImageReward • HPSv2

Visual qualityTraining speedCompositional benchmark performanceUser-preference scores

Related Fields

Generative AIComputer VisionDeep LearningReinforcement Learning (from Human Feedback)Human-Computer Interaction

Keywords

Text-to-ImageDiffusion ModelsGenerative AIReward ModelsUser PreferencesMIROTraining EfficiencyImage QualityCompositional GenerationAlignmentDeep Learning

Academic Context

#Generative AI#Computer Vision#Deep Learning#Image Generation#Human-AI Interaction

Commercial Potential

Potential Products

Advanced text-to-image generation servicesCreative tools for artists and designersPersonalized content generation platforms

Target Industries

Media & EntertainmentAdvertisingGamingDesignE-commerce

Use Case Examples

Generating marketing visuals from text descriptionsCreating concept art for games and filmsPersonalizing image content for users

Competitive Edge

Improves upon existing text-to-image models by integrating user preference alignment directly into the training process, leading to better quality and efficiency.

Market Opportunity

Rapidly growing market for generative AI and creative tools.

Revenue Models

API accessSaaS platformslicensing of models.

Resource Requirements

Compute Needs

Very High, for training large diffusion models.

Data Requirements

Large datasets of image-text pairs, and datasets for training reward models.

Deployment Constraints

Requires significant computational resources for inference.

Scalability

Scalable with distributed training infrastructure.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for improved commercial models.

Patent Potential

Moderate

View Full Paper Back to Papers