Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The deployment of Large Language Models (LLMs) on CPU-based edge devices is
crucial for enabling on-device intelligence and expanding AI accessibility.
However, it remains challenging due to limited memory and computational
resources. During edge inference, memory usage and latency are the primary
bottlenecks. Although weight quantization can effectively reduce memory
consumption, existing hardware-friendly approaches often rely on uniform
quantization, which poorly fits weight distributions and incurs high
dequantization overhead at low bit widths. To address these limitations, we
propose ELUTQ, an efficient quantization framework introducing a novel
quantization format, Hierarchical Linear Quantization (HLQ). HLQ better
captures the statistical characteristics of weights without increasing the
computational cost of Bit-serial LUT-based GEMM operations, thereby eliminating
dequantization overhead. It is orthogonal to existing quantization algorithms
and can be seamlessly integrated into various quantization pipelines. For
efficient on-device deployment, ELUTQ provides optimized CPU kernels for
end-to-end inference. Experiments show that for LLaMA3-8B, HLQ reduces
perplexity by about 8% at 3-bit and 85% at 2-bit precision under post-training
quantization, completing quantization within one hour. With efficient
finetuning, HLQ further improves 2-bit performance within two hours. In terms
of inference efficiency, our 2-bit LLaMA2-7B achieves over 25 tokens/s on an
Apple M2 chip (4 threads, batch size = 1).
Authors (5)
Xin Nie
Liang Dong
HaiCheng Zhang
JiaWang Xiao
G. Sun
Submitted
October 22, 2025
Key Contributions
ELUTQ proposes Hierarchical Linear Quantization (HLQ) to address the limitations of uniform quantization for LLMs on edge devices. HLQ better captures weight distributions without increasing computational cost, eliminating dequantization overhead and enabling efficient deployment.
Business Value
Enables running powerful LLMs directly on edge devices, reducing reliance on cloud infrastructure and improving privacy and responsiveness for applications like smart assistants and local data processing.