Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 90% Match Research paper AI researchers,Computer vision engineers,NLP researchers,Digital artists,Content creators 1 week ago

Compositional Image Synthesis with Inference-Time Scaling

generative-ai › diffusion-models
📄 Abstract

Abstract: Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at https://github.com/gcl-inha/ReFocus.
Authors (3)
Minsuk Ji
Sanghyeok Lee
Namhyuk Ahn
Submitted
October 28, 2025
arXiv Category
cs.CV
arXiv PDF Code

Key Contributions

This paper presents a training-free framework to improve compositionality in text-to-image synthesis. It uses LLMs to generate explicit layouts from prompts, injects these layouts into the generation process, and employs an object-centric VLM to rerank candidates, achieving stronger scene alignment with prompts through inference-time scaling and self-refinement.

Business Value

Enables the creation of more precise and controllable visual content from text descriptions, benefiting graphic designers, advertisers, and content creators who require accurate scene composition.

View Code on GitHub