Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent text-to-image diffusion models can generate striking visuals from text
prompts, but they often fail to maintain subject consistency across generations
and contexts. One major limitation of current fine-tuning approaches is the
inherent trade-off between quality and efficiency. Fine-tuning large models
improves fidelity but is computationally expensive, while fine-tuning
lightweight models improves efficiency but compromises image fidelity.
Moreover, fine-tuning pre-trained models on a small set of images of the
subject can damage the existing priors, resulting in suboptimal results. To
this end, we present Stencil, a novel framework that jointly employs two
diffusion models during inference. Stencil efficiently fine-tunes a lightweight
model on images of the subject, while a large frozen pre-trained model provides
contextual guidance during inference, injecting rich priors to enhance
generation with minimal overhead. Stencil excels at generating high-fidelity,
novel renditions of the subject in less than a minute, delivering
state-of-the-art performance and setting a new benchmark in subject-driven
generation.