arxiv_ml 95% Match Research Paper ML Researchers,Hardware Engineers,AI System Designers 2 weeks ago

Energy-Efficient and Dequantization-Free Q-LLMs: A Spiking Neural Network Approach to Salient Value Mitigation

large-language-models › model-architecture

📄 Abstract

Abstract: In the era of large language models (LLMs), weight-activation quantization helps fit models on edge device by reducing memory and compute bit-widths. However, three challenges persist for energy constrained hardware: (1) even after quantization, multiply-accumulate (MAC) operations remain unavoidable and continue to dominate energy consumption; (2) dequantization (or per-tensor/channel rescaling) introduces extra arithmetic and data movement, increasing latency and energy; (3) uniform parameters bit widths clip salient values-while intra-channel mixed precision is generally impractical on current matrix hardware and memory. In contrast, brain-inspired Spiking Neural Networks (SNNs), owing to their binary spike-based information representation and the Integrate-and-Fire (IF) paradigm, naturally support mixed-precision storage and energy-efficient computation by replacing complex MACs with temporal Accumulate (ACCs). Motivated by this property, we propose SpikeQuant, which selectively applies mixed-precision quantization to activations with salient values and re-encodes them into binary spike counts, thereby enabling dynamic mixed storage of different bitwidths. Furthermore, by embedding the quantization scale into the threshold of the IF mechanism, our approach performs energy-efficient linear transformations on weights and activations while avoiding explicit dequantization. Experimental results demonstrate that SpikeQuant consistently achieves near-FP16 perplexity under W4A4 quantization while reducing energy cost by up to 4.6 times compared to existing methods, highlighting its effectiveness for accurate and energy-efficient LLM deployment.

Authors (5)

Chenyu Wang

Zhanglu Yan

Zhi Zhou

Xu Chen

Weng-Fai Wong

Submitted

October 22, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Proposes SpikeQuant, a method that leverages Spiking Neural Networks (SNNs) to achieve energy-efficient and dequantization-free LLM inference on edge devices. SNNs naturally support mixed precision and replace MAC operations with temporal accumulation, mitigating salient value issues.

Business Value

Significantly reduces the energy footprint of LLMs on edge devices, enabling longer battery life and more complex AI functionalities in power-constrained environments.

Paper Metadata

Innovation Type

Algorithmic Approach

Deployment Feasibility

Moderate to High, dependent on the availability and maturity of SNN hardware accelerators.

Limitations Addressed

High energy consumption of MAC operations in quantized LLMs,Latency and energy overhead from dequantization,Clipping of salient values by uniform quantization parameters,Impracticality of intra-channel mixed precision on current hardware

Technical Tags

Spiking Neural Networks (SNNs)QuantizationLLMEdge DevicesEnergy EfficiencyDequantization-FreeSalient Value MitigationMixed PrecisionMAC operationsIntegrate-and-Fire (IF)

Research Topics

Energy-Efficient AISpiking Neural NetworksLLM QuantizationEdge Computing

Methods & Architectures

Spiking Neural Networks (SNNs)Integrate-and-Fire (IF) paradigmSpikeQuant (proposed method) Large Language Models (LLMs)Spiking Neural Networks (SNNs)

Applications & Tasks

Edge Computing Neuromorphic Computing Energy ConsumptionLatencyDequantization OverheadSalient Value Clipping Energy-efficient LLM inference on edge devices

Related Fields

Neuromorphic EngineeringArtificial Neural NetworksHardware AccelerationModel Compression

Keywords

LLMQuantizationSpiking Neural NetworkSNNEdge AIEnergy EfficiencyDequantization-FreeMixed PrecisionNeuromorphicMACSalient ValuesIF Neuron

Academic Context

#Energy-Efficient AI#Spiking Neural Networks#LLM Quantization#Edge Computing

Commercial Potential

Potential Products

Ultra-low power AI chips for edge devicesEnergy-efficient LLM inference enginesNeuromorphic hardware accelerators

Target Industries

Consumer ElectronicsIoTRoboticsWearables

Use Case Examples

Always-on AI features in smartwatchesLow-power natural language understanding in embedded sensorsRobots with extended operational times

Competitive Edge

Offers a novel approach using SNNs to overcome fundamental energy and quantization challenges in LLM deployment, distinct from traditional quantization methods.

Market Opportunity

Significant growth in edge AI and demand for energy-efficient computation.

Revenue Models

Licensing of SNN-based LLM inference technologydevelopment of specialized hardware.

Resource Requirements

Compute Needs

Moderate for training SNNs, potentially very low for inference on specialized hardware.

Data Requirements

Standard LLM training datasets, potentially adapted for SNN training.

Deployment Constraints

Hardware support for SNNs,Training complexity of SNNs

Scalability

Potential for high scalability due to inherent efficiency of SNNs, but dependent on hardware.

Production Readiness

Maturity Level

Research

Time to Market

2-5 years

Patent Potential

High, for the SpikeQuant method and its application.

View Full Paper Back to Papers