Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: In the era of large language models (LLMs), weight-activation quantization
helps fit models on edge device by reducing memory and compute bit-widths.
However, three challenges persist for energy constrained hardware: (1) even
after quantization, multiply-accumulate (MAC) operations remain unavoidable and
continue to dominate energy consumption; (2) dequantization (or
per-tensor/channel rescaling) introduces extra arithmetic and data movement,
increasing latency and energy; (3) uniform parameters bit widths clip salient
values-while intra-channel mixed precision is generally impractical on current
matrix hardware and memory. In contrast, brain-inspired Spiking Neural Networks
(SNNs), owing to their binary spike-based information representation and the
Integrate-and-Fire (IF) paradigm, naturally support mixed-precision storage and
energy-efficient computation by replacing complex MACs with temporal Accumulate
(ACCs). Motivated by this property, we propose SpikeQuant, which selectively
applies mixed-precision quantization to activations with salient values and
re-encodes them into binary spike counts, thereby enabling dynamic mixed
storage of different bitwidths. Furthermore, by embedding the quantization
scale into the threshold of the IF mechanism, our approach performs
energy-efficient linear transformations on weights and activations while
avoiding explicit dequantization. Experimental results demonstrate that
SpikeQuant consistently achieves near-FP16 perplexity under W4A4 quantization
while reducing energy cost by up to 4.6 times compared to existing methods,
highlighting its effectiveness for accurate and energy-efficient LLM
deployment.
Authors (5)
Chenyu Wang
Zhanglu Yan
Zhi Zhou
Xu Chen
Weng-Fai Wong
Submitted
October 22, 2025
Key Contributions
Proposes SpikeQuant, a method that leverages Spiking Neural Networks (SNNs) to achieve energy-efficient and dequantization-free LLM inference on edge devices. SNNs naturally support mixed precision and replace MAC operations with temporal accumulation, mitigating salient value issues.
Business Value
Significantly reduces the energy footprint of LLMs on edge devices, enabling longer battery life and more complex AI functionalities in power-constrained environments.