Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 95% Match Research Paper ML researchers focused on interpretability,AI safety researchers,Computer vision engineers,Data scientists 20 hours ago

Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit

ai-safety › interpretability
📄 Abstract

Abstract: Sparse autoencoders (SAEs) have recently become central tools for interpretability, leveraging dictionary learning principles to extract sparse, interpretable features from neural representations whose underlying structure is typically unknown. This paper evaluates SAEs in a controlled setting using MNIST, which reveals that current shallow architectures implicitly rely on a quasi-orthogonality assumption that limits the ability to extract correlated features. To move beyond this, we compare them with an iterative SAE that unrolls Matching Pursuit (MP-SAE), enabling the residual-guided extraction of correlated features that arise in hierarchical settings such as handwritten digit generation while guaranteeing monotonic improvement of the reconstruction as more atoms are selected.

Key Contributions

This paper evaluates Sparse Autoencoders (SAEs) for interpretability, revealing limitations of shallow architectures in capturing correlated features due to an implicit quasi-orthogonality assumption. It proposes and compares an iterative SAE (MP-SAE) that uses Matching Pursuit, enabling the extraction of correlated and hierarchical features while guaranteeing monotonic improvement in reconstruction.

Business Value

Improved interpretability of neural networks can lead to more trustworthy AI systems, better debugging, and insights into model decision-making, crucial for high-stakes applications.