Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The high inference demands of transformer-based Large Language Models (LLMs)
pose substantial challenges in their deployment. To this end, we introduce
Neural Block Linearization (NBL), a novel framework for accelerating
transformer model inference by replacing self-attention layers with linear
approximations derived from Linear Minimum Mean Squared Error estimators. NBL
leverages Canonical Correlation Analysis to compute a theoretical upper bound
on the approximation error. Then, we use this bound as a criterion for
substitution, selecting the LLM layers with the lowest linearization error. NBL
can be efficiently applied to pre-trained LLMs without the need for
fine-tuning. In experiments, NBL achieves notable computational speed-ups while
preserving competitive accuracy on multiple reasoning benchmarks. For instance,
applying NBL to 12 self-attention layers in DeepSeek-R1-Distill-Llama-8B
increases the inference speed by 32% with less than 1% accuracy trade-off,
making it a flexible and promising solution to improve the inference efficiency
of LLMs. The implementation is available at: https://github.com/LIONS-EPFL/NBL.
Authors (3)
Mete Erdogan
Francesco Tonin
Volkan Cevher
Key Contributions
Introduces Neural Block Linearization (NBL), a novel framework that accelerates transformer LLM inference by replacing self-attention layers with linear approximations derived from LMMSE. NBL uses CCA to bound approximation error and selects layers for substitution based on this error, achieving significant speed-ups (e.g., 32% for DeepSeek-R1-Distill-Llama-8B) with minimal accuracy loss (<1%). Crucially, it works on pre-trained LLMs without fine-tuning.
Business Value
Significantly reduces the cost and latency of deploying LLMs, making them more accessible for real-time applications and on devices with limited computational power. This broadens the market for LLM-powered services.