Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Motivated by the hypothesis that neural network representations encode
abstract, interpretable features as linearly accessible, approximately
orthogonal directions, sparse autoencoders (SAEs) have become a popular tool in
interpretability. However, recent work has demonstrated phenomenology of model
representations that lies outside the scope of this hypothesis, showing
signatures of hierarchical, nonlinear, and multi-dimensional features. This
raises the question: do SAEs represent features that possess structure at odds
with their motivating hypothesis? If not, does avoiding this mismatch help
identify said features and gain further insights into neural network
representations? To answer these questions, we take a construction-based
approach and re-contextualize the popular matching pursuits (MP) algorithm from
sparse coding to design MP-SAE -- an SAE that unrolls its encoder into a
sequence of residual-guided steps, allowing it to capture hierarchical and
nonlinearly accessible features. Comparing this architecture with existing SAEs
on a mixture of synthetic and natural data settings, we show: (i) hierarchical
concepts induce conditionally orthogonal features, which existing SAEs are
unable to faithfully capture, and (ii) the nonlinear encoding step of MP-SAE
recovers highly meaningful features, helping us unravel shared structure in the
seemingly dichotomous representation spaces of different modalities in a
vision-language model, hence demonstrating the assumption that useful features
are solely linearly accessible is insufficient. We also show that the
sequential encoder principle of MP-SAE affords an additional benefit of
adaptive sparsity at inference time, which may be of independent interest.
Overall, we argue our results provide credence to the idea that
interpretability should begin with the phenomenology of representations, with
methods emerging from assumptions that fit it.
Key Contributions
This paper proposes MP-SAE, an SAE variant inspired by Matching Pursuit, to better capture hierarchical and non-linear features in neural network representations, moving beyond the linear and orthogonal assumptions of standard SAEs. By re-contextualizing MP, it aims to identify features that exhibit structure at odds with simpler hypotheses, providing deeper insights into network representations.
Business Value
Enhanced understanding of complex neural network representations can lead to more robust, reliable, and interpretable AI systems, facilitating debugging and trust.