Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: For large language models (LLMs), sparse autoencoders (SAEs) have been shown
to decompose intermediate representations that often are not interpretable
directly into sparse sums of interpretable features, facilitating better
control and subsequent analysis. However, similar analyses and approaches have
been lacking for text-to-image models. We investigate the possibility of using
SAEs to learn interpretable features for SDXL Turbo, a few-step text-to-image
diffusion model. To this end, we train SAEs on the updates performed by
transformer blocks within SDXL Turbo's denoising U-net in its 1-step setting.
Interestingly, we find that they generalize to 4-step SDXL Turbo and even to
the multi-step SDXL base model (i.e., a different model) without additional
training. In addition, we show that their learned features are interpretable,
causally influence the generation process, and reveal specialization among the
blocks. We do so by creating RIEBench, a representation-based image editing
benchmark, for editing images while they are generated by turning on and off
individual SAE features. This allows us to track which transformer blocks'
features are the most impactful depending on the edit category. Our work is the
first investigation of SAEs for interpretability in text-to-image diffusion
models and our results establish SAEs as a promising approach for understanding
and manipulating the internal mechanisms of text-to-image models.
Authors (8)
Viacheslav Surkov
Chris Wendler
Antonio Mari
Mikhail Terekhov
Justin Deschenaux
Robert West
+2 more
Submitted
October 28, 2024
Key Contributions
This work investigates the use of Sparse Autoencoders (SAEs) to learn interpretable features within text-to-image diffusion models like SDXL Turbo. It demonstrates that these learned features are interpretable, causally influence generation, generalize across model steps and even to different models, and reveal specialization among network blocks.
Business Value
Enhances control and understanding of powerful text-to-image models, enabling more precise creative applications and potentially leading to more efficient and controllable generative AI systems.