Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 70% Match Research Paper AI Researchers,Creative Professionals,Software Developers in Graphics/Media 2 weeks ago

LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration

generative-ai › diffusion
📄 Abstract

Abstract: Text-to-image (T2I) generation has made remarkable progress, yet existing systems still lack intuitive control over spatial composition, object consistency, and multi-step editing. We present $\textbf{LayerCraft}$, a modular framework that uses large language models (LLMs) as autonomous agents to orchestrate structured, layered image generation and editing. LayerCraft supports two key capabilities: (1) $\textit{structured generation}$ from simple prompts via chain-of-thought (CoT) reasoning, enabling it to decompose scenes, reason about object placement, and guide composition in a controllable, interpretable manner; and (2) $\textit{layered object integration}$, allowing users to insert and customize objects -- such as characters or props -- across diverse images or scenes while preserving identity, context, and style. The system comprises a coordinator agent, the $\textbf{ChainArchitect}$ for CoT-driven layout planning, and the $\textbf{Object Integration Network (OIN)}$ for seamless image editing using off-the-shelf T2I models without retraining. Through applications like batch collage editing and narrative scene generation, LayerCraft empowers non-experts to iteratively design, customize, and refine visual content with minimal manual effort. Code will be released at https://github.com/PeterYYZhang/LayerCraft.
Authors (3)
Yuyao Zhang
Jinghao Li
Yu-Wing Tai
Submitted
March 25, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

LayerCraft is a modular framework using LLMs as agents to orchestrate structured, layered text-to-image generation and editing. It enables structured generation via CoT reasoning for scene decomposition and object placement, and layered object integration while preserving identity and context.

Business Value

Empowers creators and designers with more intuitive and powerful tools for generating and manipulating visual content, potentially revolutionizing digital art, advertising, and media production.