Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 93% Match Research Paper ML Researchers,AI Interpretability Experts,LLM Developers 2 weeks ago

Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

large-language-models › model-architecture
📄 Abstract

Abstract: Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping--significantly increasing model's next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights--preserving the original decoders' expressive capacity even under heavy sparsity. Experimentally, we show that MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models with up to 3B parameters. Further evaluations on sparse probing and feature steering demonstrate that MxDs learn similarly specialized features of natural language--opening up a promising new avenue for designing interpretable yet faithful decompositions. Our code is included at: https://github.com/james-oldfield/MxD/.
Authors (6)
James Oldfield
Shawn Im
Sharon Li
Mihalis A. Nicolaou
Ioannis Patras
Grigorios G Chrysos
Submitted
May 27, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

Introduces Mixture of Decoders (MxDs) to achieve faithful dense layer decomposition with layer-level sparsity, overcoming the accuracy trade-off of neuron-level sparsity. MxDs generalize MLPs and GLUs, expanding dense layers into specialized sublayers via tensor factorization, preserving expressive capacity even under heavy sparsity. This enables better interpretability without sacrificing performance.

Business Value

Enables better understanding and control over LLMs, leading to more reliable, debuggable, and steerable AI systems, crucial for high-stakes applications.