Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Text-to-image models are powerful for producing high-quality images based on
given text prompts, but crafting these prompts often requires specialized
vocabulary. To address this, existing methods train rewriting models with
supervision from large amounts of manually annotated data and trained aesthetic
assessment models. To alleviate the dependence on data scale for model training
and the biases introduced by trained models, we propose a novel prompt
optimization framework, designed to rephrase a simple user prompt into a
sophisticated prompt to a text-to-image model. Specifically, we employ the
large vision language models (LVLMs) as the solver to rewrite the user prompt,
and concurrently, employ LVLMs as a reward model to score the aesthetics and
alignment of the images generated by the optimized prompt. Instead of laborious
human feedback, we exploit the prior knowledge of the LVLM to provide rewards,
i.e., AI feedback. Simultaneously, the solver and the reward model are unified
into one model and iterated in reinforcement learning to achieve
self-improvement by giving a solution and judging itself. Results on two
popular datasets demonstrate that our method outperforms other strong
competitors.