Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Text-to-image (T2I) generation has made remarkable progress, yet existing
systems still lack intuitive control over spatial composition, object
consistency, and multi-step editing. We present $\textbf{LayerCraft}$, a
modular framework that uses large language models (LLMs) as autonomous agents
to orchestrate structured, layered image generation and editing. LayerCraft
supports two key capabilities: (1) $\textit{structured generation}$ from simple
prompts via chain-of-thought (CoT) reasoning, enabling it to decompose scenes,
reason about object placement, and guide composition in a controllable,
interpretable manner; and (2) $\textit{layered object integration}$, allowing
users to insert and customize objects -- such as characters or props -- across
diverse images or scenes while preserving identity, context, and style. The
system comprises a coordinator agent, the $\textbf{ChainArchitect}$ for
CoT-driven layout planning, and the $\textbf{Object Integration Network (OIN)}$
for seamless image editing using off-the-shelf T2I models without retraining.
Through applications like batch collage editing and narrative scene generation,
LayerCraft empowers non-experts to iteratively design, customize, and refine
visual content with minimal manual effort. Code will be released at
https://github.com/PeterYYZhang/LayerCraft.
Authors (3)
Yuyao Zhang
Jinghao Li
Yu-Wing Tai
Key Contributions
LayerCraft is a modular framework using LLMs as agents to orchestrate structured, layered text-to-image generation and editing. It enables structured generation via CoT reasoning for scene decomposition and object placement, and layered object integration while preserving identity and context.
Business Value
Empowers creators and designers with more intuitive and powerful tools for generating and manipulating visual content, potentially revolutionizing digital art, advertising, and media production.