Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Enabling image generation models to be spatially controlled is an important
area of research, empowering users to better generate images according to their
own fine-grained specifications via e.g. edge maps, poses. Although this task
has seen impressive improvements in recent times, a focus on rapidly producing
stronger models has come at the cost of detailed and fair scientific
comparison. Differing training data, model architectures and generation
paradigms make it difficult to disentangle the factors contributing to
performance. Meanwhile, the motivations and nuances of certain approaches
become lost in the literature. In this work, we aim to provide clear takeaways
across generation paradigms for practitioners wishing to develop
transformer-based systems for spatially-controlled generation, clarifying the
literature and addressing knowledge gaps. We perform controlled experiments on
ImageNet across diffusion-based/flow-based and autoregressive (AR) models.
First, we establish control token prefilling as a simple, general and
performant baseline approach for transformers. We then investigate previously
underexplored sampling time enhancements, showing that extending
classifier-free guidance to control, as well as softmax truncation, have a
strong impact on control-generation consistency. Finally, we re-clarify the
motivation of adapter-based approaches, demonstrating that they mitigate
"forgetting" and maintain generation quality when trained on limited downstream
data, but underperform full training in terms of generation-control
consistency.
Key Contributions
Provides a practical investigation and clarification of spatially-controlled image generation using transformers, performing controlled experiments across diffusion, flow-based, and autoregressive models on ImageNet. The work aims to disentangle performance factors and address knowledge gaps for practitioners developing such systems.
Business Value
Empowers designers, artists, and developers to create highly specific and customized images more efficiently. This can accelerate workflows in advertising, game development, virtual reality, and product design.