arxiv_ai 90% Match Research paper AI researchers,Computer vision engineers,NLP researchers,Digital artists,Content creators 1 week ago

Compositional Image Synthesis with Inference-Time Scaling

generative-ai › diffusion-models

📄 Abstract

Abstract: Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at https://github.com/gcl-inha/ReFocus.

Authors (3)

Minsuk Ji

Sanghyeok Lee

Namhyuk Ahn

Submitted

October 28, 2025

arXiv Category

cs.CV

arXiv PDF Code

Key Contributions

This paper presents a training-free framework to improve compositionality in text-to-image synthesis. It uses LLMs to generate explicit layouts from prompts, injects these layouts into the generation process, and employs an object-centric VLM to rerank candidates, achieving stronger scene alignment with prompts through inference-time scaling and self-refinement.

Business Value

Enables the creation of more precise and controllable visual content from text descriptions, benefiting graphic designers, advertisers, and content creators who require accurate scene composition.

Paper Metadata

Innovation Type

Algorithmic/Framework

Deployment Feasibility

High, as it's a training-free framework applied at inference time.

Limitations Addressed

Struggles of modern text-to-image models with compositionality, leading to inaccurate object counts, attributes, and spatial relations.

Performance Gains

Achieves stronger scene alignment with prompts compared to recent text-to-image models.

View Code on GitHub

Technical Tags

Text-to-image synthesisCompositionalityObject countsAttributesSpatial relationsLayout faithfulnessSelf-refinementObject-centric VLMLLM for layout generationInference-time scaling

Research Topics

Generative AIText-to-image synthesisCompositional generalizationVision-language modelsAI creativity

Methods & Architectures

LLM for layout synthesisObject-centric VLM for rerankingSelf-refinementInference-time scalingLayout injection Vision-Language Models (VLMs)Large Language Models (LLMs)Diffusion models (implied)

Applications & Tasks

Digital art Content creation Design Advertising Lack of compositionality in text-to-image modelsInaccurate object counts, attributes, and spatial relationsImproving layout faithfulness Generating images that accurately reflect compositional elements of text promptsImproving scene alignment and aesthetic quality in text-to-image synthesis

Related Fields

Generative AIComputer visionNatural Language ProcessingMultimodal AIComputational creativity

Keywords

Text-to-ImageGenerative AICompositionalityLayoutLLMVLMSelf-RefinementInference-timeImage SynthesisDiffusion ModelsPrompt EngineeringScene Generation

Academic Context

#Generative AI#Text-to-image synthesis#Compositional generalization#Vision-language models#AI creativity

Commercial Potential

Potential Products

Advanced text-to-image generation toolsAI-powered design assistantsTools for generating marketing visuals

Target Industries

Media and EntertainmentAdvertisingDesignGamingE-commerce

Use Case Examples

Generating an image of 'two red apples and one green pear on a table' with correct counts and placementCreating a scene based on a complex descriptive prompt with specific object interactionsDesigning marketing materials with precise visual elements

Competitive Edge

Improves upon existing text-to-image models by explicitly addressing compositionality issues at inference time, leading to more faithful prompt adherence.

Market Opportunity

Rapidly growing market for generative AI tools, particularly text-to-image.

Revenue Models

Integration into SaaS platformslicensing of the frameworkdevelopment of specialized creative tools.

Resource Requirements

Compute Needs

Moderate to high, depending on the underlying text-to-image model and the number of reranked candidates.

Data Requirements

Leverages existing large text-to-image models and VLMs; does not require specific training datasets.

Deployment Constraints

Relies on the quality of the LLM's layout generation and the VLM's judgment.

Scalability

Scalable as it's an inference-time technique applied to existing generative models.

Production Readiness

Maturity Level

Research prototype

Time to Market

1-2 years for integration into existing platforms.

Licensing

MIT License (based on GitHub)

Patent Potential

Moderate, related to the method of combining LLM-generated layouts with VLM-based reranking for image synthesis.

View Full Paper Back to Papers