Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 90% Match Research Paper Researchers in generative AI,Developers of text-to-image models,Artists and designers using AI tools,Computer vision engineers 1 month ago

RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

computer-vision › diffusion-models
📄 Abstract

Abstract: Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., canny edge) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning-based approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. Through an empirical analysis of existing methods, we identify a key limitation: the sampling schedule of condition features, previously unexplored, fails to account for the evolving interplay between structure preservation and domain alignment throughout diffusion steps. Inspired by this observation, we propose a flexible training-free framework that decouples the sampling schedule of condition features from the denoising process, and systematically investigate the spectrum of feature injection schedules for a higher-quality structure guidance in the feature space. Specifically, we find that condition features sampled from a single timestep are sufficient, yielding a simple yet efficient schedule that balances structure alignment and appearance quality. We further enhance the sampling process by introducing a restart refinement schedule, and improve the visual quality with an appearance-rich prompting strategy. Together, these designs enable training-free generation that is both structure-rich and appearance-rich. Extensive experiments show that our approach achieves state-of-the-art results across diverse zero-shot conditioning scenarios.

Key Contributions

Proposes RichControl, a flexible training-free framework for spatial control in text-to-image diffusion models. It identifies and addresses limitations in existing feature injection methods by decoupling the sampling schedule of condition features from the denoising process, leading to better structure preservation and domain alignment.

Business Value

Enables more precise and reliable control over image generation for creative professionals, designers, and content creators, leading to higher quality and more targeted visual outputs.