Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We propose Flow-GRPO, the first method to integrate online policy gradient
reinforcement learning (RL) into flow matching models. Our approach uses two
key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic
Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential
Equation (SDE) that matches the original model's marginal distribution at all
timesteps, enabling statistical sampling for RL exploration; and (2) a
Denoising Reduction strategy that reduces training denoising steps while
retaining the original number of inference steps, significantly improving
sampling efficiency without sacrificing performance. Empirically, Flow-GRPO is
effective across multiple text-to-image tasks. For compositional generation,
RL-tuned SD3.5-M generates nearly perfect object counts, spatial relations, and
fine-grained attributes, increasing GenEval accuracy from $63\%$ to $95\%$. In
visual text rendering, accuracy improves from $59\%$ to $92\%$, greatly
enhancing text generation. Flow-GRPO also achieves substantial gains in human
preference alignment. Notably, very little reward hacking occurred, meaning
rewards did not increase at the cost of appreciable image quality or diversity
degradation.
Authors (9)
Jie Liu
Gongye Liu
Jiajun Liang
Yangguang Li
Jiaheng Liu
Xintao Wang
+3 more
Key Contributions
Flow-GRPO is the first method to integrate online policy gradient RL into flow matching models. It uses an ODE-to-SDE conversion for RL exploration and a denoising reduction strategy for sampling efficiency, significantly improving generation quality and control, especially for compositional tasks.
Business Value
Enables the creation of more controllable and higher-fidelity generative models for applications like graphic design, advertising, and personalized content creation.