Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Text-to-image (T2I) diffusion models have shown remarkable success in
generating high-quality images from text prompts. Recent efforts extend these
models to incorporate conditional images (e.g., canny edge) for fine-grained
spatial control. Among them, feature injection methods have emerged as a
training-free alternative to traditional fine-tuning-based approaches. However,
they often suffer from structural misalignment, condition leakage, and visual
artifacts, especially when the condition image diverges significantly from
natural RGB distributions. Through an empirical analysis of existing methods,
we identify a key limitation: the sampling schedule of condition features,
previously unexplored, fails to account for the evolving interplay between
structure preservation and domain alignment throughout diffusion steps.
Inspired by this observation, we propose a flexible training-free framework
that decouples the sampling schedule of condition features from the denoising
process, and systematically investigate the spectrum of feature injection
schedules for a higher-quality structure guidance in the feature space.
Specifically, we find that condition features sampled from a single timestep
are sufficient, yielding a simple yet efficient schedule that balances
structure alignment and appearance quality. We further enhance the sampling
process by introducing a restart refinement schedule, and improve the visual
quality with an appearance-rich prompting strategy. Together, these designs
enable training-free generation that is both structure-rich and
appearance-rich. Extensive experiments show that our approach achieves
state-of-the-art results across diverse zero-shot conditioning scenarios.
Key Contributions
Proposes RichControl, a flexible training-free framework for spatial control in text-to-image diffusion models. It identifies and addresses limitations in existing feature injection methods by decoupling the sampling schedule of condition features from the denoising process, leading to better structure preservation and domain alignment.
Business Value
Enables more precise and reliable control over image generation for creative professionals, designers, and content creators, leading to higher quality and more targeted visual outputs.