arxiv_ai 93% Match Research Paper ML Researchers,AI Interpretability Experts,LLM Developers 2 weeks ago

Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

large-language-models › model-architecture

📄 Abstract

Abstract: Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping--significantly increasing model's next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights--preserving the original decoders' expressive capacity even under heavy sparsity. Experimentally, we show that MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models with up to 3B parameters. Further evaluations on sparse probing and feature steering demonstrate that MxDs learn similarly specialized features of natural language--opening up a promising new avenue for designing interpretable yet faithful decompositions. Our code is included at: https://github.com/james-oldfield/MxD/.

Authors (6)

James Oldfield

Shawn Im

Sharon Li

Mihalis A. Nicolaou

Ioannis Patras

Grigorios G Chrysos

Submitted

May 27, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Introduces Mixture of Decoders (MxDs) to achieve faithful dense layer decomposition with layer-level sparsity, overcoming the accuracy trade-off of neuron-level sparsity. MxDs generalize MLPs and GLUs, expanding dense layers into specialized sublayers via tensor factorization, preserving expressive capacity even under heavy sparsity. This enables better interpretability without sacrificing performance.

Business Value

Enables better understanding and control over LLMs, leading to more reliable, debuggable, and steerable AI systems, crucial for high-stakes applications.

Paper Metadata

Innovation Type

Algorithmic and Architectural Innovation

Deployment Feasibility

Moderate, as it involves modifying model architecture, but aims to be compatible with existing LLMs.

Limitations Addressed

Difficulty in understanding, editing, and steering MLPs due to dense representations; accuracy trade-offs in existing sparse approximation methods.

Performance Gains

Significantly outperforms state-of-the-art methods in achieving faithful sparse layer approximations while preserving model performance.

Technical Tags

interpretabilityMLP decompositionMixture of Decoders (MxDs)tensor factorizationsparse layersexpressive capacitymodel editingmodel steeringlarge language models

Research Topics

Model InterpretabilityLarge Language ModelsModel ArchitectureSparsity

Methods & Architectures

Mixture of Decoders (MxDs)Tensor factorizationLayer-level sparsity Multilayer Perceptrons (MLPs)Mixture of Decoders (MxDs)

Applications & Tasks

AI Interpretability Model Debugging AI Safety Difficulty in understanding/editing MLPsAccuracy trade-off in sparse approximations Model InterpretabilityModel EditingModel SteeringLayer Decomposition

Related Fields

Machine LearningDeep LearningArtificial IntelligenceComputer Science

Keywords

interpretabilityLLMMLPMixture of Decoderssparsitytensor factorizationmodel editingmodel steeringdeep learningneural networks

Academic Context

#Model Interpretability#Large Language Models#Model Architecture#Sparsity

Commercial Potential

Potential Products

Interpretable LLM frameworksTools for debugging and steering AI modelsSafer AI systems

Target Industries

TechnologyAI ResearchFinanceHealthcare

Use Case Examples

Understanding why an LLM makes certain predictionsModifying specific behaviors of an LLM without retrainingImproving the safety and reliability of AI systems

Competitive Edge

Offers a novel architectural approach (MxDs) to achieve interpretability without sacrificing performance, addressing a key limitation of prior methods.

Market Opportunity

Growing demand for interpretable and controllable AI.

Revenue Models

Licensing of interpretable model architectures or tools.

Resource Requirements

Compute Needs

Requires compute for training models with MxDs, potentially comparable to training standard LLMs.

Data Requirements

Standard datasets used for training LLMs.

Deployment Constraints

Increased model size due to the expansion of dense layers into sublayers; potential complexity in managing sparse activations.

Scalability

The approach aims to scale by decomposing large MLPs within LLMs.

Regulatory Considerations

Transparency requirements for AI systems.

Production Readiness

Maturity Level

Research

Time to Market

2-3 years for integration into LLM development.

Patent Potential

Moderate, for the Mixture of Decoders architecture and its factorization technique.

View Full Paper Back to Papers