arxiv_ai 85% Match Research Paper Researchers in generative AI,Developers of creative tools,UX designers,AI interaction designers 1 week ago

Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty

large-language-models › multimodal-llms

📄 Abstract

Abstract: User prompts for generative AI models are often underspecified, leading to a misalignment between the user intent and models' understanding. As a result, users commonly have to painstakingly refine their prompts. We study this alignment problem in text-to-image (T2I) generation and propose a prototype for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their uncertainty about user intent as an understandable and editable belief graph. We build simple prototypes for such agents and propose a new scalable and automated evaluation approach using two agents, one with a ground truth intent (an image) while the other tries to ask as few questions as possible to align with the ground truth. We experiment over three image-text datasets: ImageInWords (Garg et al., 2024), COCO (Lin et al., 2014) and DesignBench, a benchmark we curated with strong artistic and design elements. Experiments over the three datasets demonstrate the proposed T2I agents' ability to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard T2I generation. Moreover, we conducted human studies and observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow, highlighting the effectiveness of our approach. Code and DesignBench can be found at https://github.com/google-deepmind/proactive_t2i_agents.

Authors (7)

Meera Hahn

Wenjun Zeng

Nithish Kannen

Rich Galt

Kartikeya Badola

Been Kim

+1 more

Submitted

December 9, 2024

arXiv Category

cs.AI

International Conference on Machine Learning, 2025

arXiv PDF

Key Contributions

Proposes proactive text-to-image (T2I) agents that actively ask clarification questions when user prompts are underspecified, and represent their uncertainty via editable belief graphs. It also introduces a novel, scalable automated evaluation approach for such interactive systems.

Business Value

Enhances user experience and efficiency for creative professionals and casual users generating images from text, leading to faster content creation and better results.

Paper Metadata

Innovation Type

System Design/Methodological

Deployment Feasibility

Moderate. Requires integrating interactive components and uncertainty representation into T2I pipelines, which adds complexity but is feasible.

Limitations Addressed

Addresses the common problem of prompt underspecification in T2I generation, which leads to misalignment between user intent and model output, requiring users to iteratively refine prompts.

Performance Gains

Qualitative improvement in user intent alignment and reduction in prompt refinement iterations.

Technical Tags

text-to-image generationproactive agentsuncertainty modelingclarification questionsbelief graphsautomated evaluationprompt underspecificationuser intent alignment

Research Topics

Multimodal AIHuman-AI InteractionGenerative ModelsNatural Language UnderstandingEvaluation Methodologies

Methods & Architectures

Proactive agent designBelief graph representationAutomated evaluation frameworkInteractive clarification Text-to-Image Models

Applications & Tasks

Creative Design Content Creation User Interface Design Image Generation User Intent AlignmentPrompt EngineeringInteractive GenerationEvaluation Improving text-to-image generation by resolving prompt underspecificationDeveloping agents that ask clarifying questionsEvaluating interactive T2I systems

Datasets & Benchmarks

Datasets

ImageInWords, COCO, DesignBench

Number of clarification questionsAlignment score (implicit)

Related Fields

Generative AIHuman-Computer InteractionNatural Language ProcessingComputer Vision

Keywords

text-to-imagegenerative AIproactive agentsuncertaintyclarificationbelief graphprompt engineeringuser intentevaluationmultimodalinteractive AI

Academic Context

#Multimodal AI#Human-AI Interaction#Generative Models#Natural Language Understanding#Evaluation Methodologies

Commercial Potential

Potential Products

Intelligent image generation assistantsCreative design software with interactive AIPrompt optimization tools

Target Industries

AdvertisingMedia & EntertainmentDesignE-commerce

Use Case Examples

An AI assistant asking for details about the desired art style or mood when generating a logoA tool helping users refine vague descriptions for character concept art

Competitive Edge

Moves beyond passive T2I generation by introducing proactive interaction and uncertainty modeling, offering a more collaborative and effective user experience.

Market Opportunity

Rapidly growing market for generative AI tools, especially in creative fields.

Revenue Models

Subscription services for AI creative assistantslicensing of the technology to platform providers.

Resource Requirements

Compute Needs

High for T2I model inference; moderate for agent logic and uncertainty tracking.

Data Requirements

Requires large-scale image-text datasets (like COCO, ImageInWords) and potentially curated design datasets.

Deployment Constraints

Latency in interactive dialogue; complexity of managing user state and uncertainty.

Scalability

The automated evaluation approach is designed for scalability.

Regulatory Considerations

Production Readiness

Maturity Level

Research

Time to Market

Medium

Patent Potential

Moderate

View Full Paper Back to Papers