Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: User prompts for generative AI models are often underspecified, leading to a
misalignment between the user intent and models' understanding. As a result,
users commonly have to painstakingly refine their prompts. We study this
alignment problem in text-to-image (T2I) generation and propose a prototype for
proactive T2I agents equipped with an interface to (1) actively ask
clarification questions when uncertain, and (2) present their uncertainty about
user intent as an understandable and editable belief graph. We build simple
prototypes for such agents and propose a new scalable and automated evaluation
approach using two agents, one with a ground truth intent (an image) while the
other tries to ask as few questions as possible to align with the ground truth.
We experiment over three image-text datasets: ImageInWords (Garg et al., 2024),
COCO (Lin et al., 2014) and DesignBench, a benchmark we curated with strong
artistic and design elements. Experiments over the three datasets demonstrate
the proposed T2I agents' ability to ask informative questions and elicit
crucial information to achieve successful alignment with at least 2 times
higher VQAScore (Lin et al., 2024) than the standard T2I generation. Moreover,
we conducted human studies and observed that at least 90% of human subjects
found these agents and their belief graphs helpful for their T2I workflow,
highlighting the effectiveness of our approach. Code and DesignBench can be
found at https://github.com/google-deepmind/proactive_t2i_agents.
Authors (7)
Meera Hahn
Wenjun Zeng
Nithish Kannen
Rich Galt
Kartikeya Badola
Been Kim
+1 more
Submitted
December 9, 2024
International Conference on Machine Learning, 2025
Key Contributions
Proposes proactive text-to-image (T2I) agents that actively ask clarification questions when user prompts are underspecified, and represent their uncertainty via editable belief graphs. It also introduces a novel, scalable automated evaluation approach for such interactive systems.
Business Value
Enhances user experience and efficiency for creative professionals and casual users generating images from text, leading to faster content creation and better results.