Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper LLM developers,ML engineers,Researchers in NLP and efficient AI 2 weeks ago

Efficient Large Language Model Inference with Neural Block Linearization

large-language-models › model-architecture
📄 Abstract

Abstract: The high inference demands of transformer-based Large Language Models (LLMs) pose substantial challenges in their deployment. To this end, we introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference by replacing self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators. NBL leverages Canonical Correlation Analysis to compute a theoretical upper bound on the approximation error. Then, we use this bound as a criterion for substitution, selecting the LLM layers with the lowest linearization error. NBL can be efficiently applied to pre-trained LLMs without the need for fine-tuning. In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy on multiple reasoning benchmarks. For instance, applying NBL to 12 self-attention layers in DeepSeek-R1-Distill-Llama-8B increases the inference speed by 32% with less than 1% accuracy trade-off, making it a flexible and promising solution to improve the inference efficiency of LLMs. The implementation is available at: https://github.com/LIONS-EPFL/NBL.
Authors (3)
Mete Erdogan
Francesco Tonin
Volkan Cevher
Submitted
May 27, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

Introduces Neural Block Linearization (NBL), a novel framework that accelerates transformer LLM inference by replacing self-attention layers with linear approximations derived from LMMSE. NBL uses CCA to bound approximation error and selects layers for substitution based on this error, achieving significant speed-ups (e.g., 32% for DeepSeek-R1-Distill-Llama-8B) with minimal accuracy loss (<1%). Crucially, it works on pre-trained LLMs without fine-tuning.

Business Value

Significantly reduces the cost and latency of deploying LLMs, making them more accessible for real-time applications and on devices with limited computational power. This broadens the market for LLM-powered services.