Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Sparse autoencoders (SAEs) have recently become central tools for
interpretability, leveraging dictionary learning principles to extract sparse,
interpretable features from neural representations whose underlying structure
is typically unknown. This paper evaluates SAEs in a controlled setting using
MNIST, which reveals that current shallow architectures implicitly rely on a
quasi-orthogonality assumption that limits the ability to extract correlated
features. To move beyond this, we compare them with an iterative SAE that
unrolls Matching Pursuit (MP-SAE), enabling the residual-guided extraction of
correlated features that arise in hierarchical settings such as handwritten
digit generation while guaranteeing monotonic improvement of the reconstruction
as more atoms are selected.
Key Contributions
This paper evaluates Sparse Autoencoders (SAEs) for interpretability, revealing limitations of shallow architectures in capturing correlated features due to an implicit quasi-orthogonality assumption. It proposes and compares an iterative SAE (MP-SAE) that uses Matching Pursuit, enabling the extraction of correlated and hierarchical features while guaranteeing monotonic improvement in reconstruction.
Business Value
Improved interpretability of neural networks can lead to more trustworthy AI systems, better debugging, and insights into model decision-making, crucial for high-stakes applications.