Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: How do vision transformers (ViTs) represent and process the world? This paper
addresses this long-standing question through the first systematic analysis of
6.6K features across all layers, extracted via sparse autoencoders, and by
introducing the residual replacement model, which replaces ViT computations
with interpretable features in the residual stream. Our analysis reveals not
only a feature evolution from low-level patterns to high-level semantics, but
also how ViTs encode curves and spatial positions through specialized feature
types. The residual replacement model scalably produces a faithful yet
parsimonious circuit for human-scale interpretability by significantly
simplifying the original computations. As a result, this framework enables
intuitive understanding of ViT mechanisms. Finally, we demonstrate the utility
of our framework in debiasing spurious correlations.