Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities but
typically require extensive computational resources and memory for inference.
Post-training quantization (PTQ) can effectively reduce these demands by
storing weights in lower bit-width formats. However, standard uniform
quantization often leads to notable performance degradation, particularly in
low-bit scenarios. In this work, we introduce a Grouped Lattice Vector
Quantization (GLVQ) framework that assigns each group of weights a customized
lattice codebook, defined by a learnable generation matrix. To address the
non-differentiability of the quantization process, we adopt Babai rounding to
approximate nearest-lattice-point search during training, which enables stable
optimization of the generation matrices. Once trained, decoding reduces to a
simple matrix-vector multiplication, yielding an efficient and practical
quantization pipeline. Experiments on multiple benchmarks show that our
approach achieves a better trade-off between model size and accuracy compared
to existing post-training quantization baselines, highlighting its
effectiveness in deploying large models under stringent resource constraints.
Our source code is available on GitHub repository:
https://github.com/xzhang9308/GLVQ.
Authors (4)
Xi Zhang
Xiaolin Wu
Jiamang Wang
Weisi Lin
Submitted
October 23, 2025
Key Contributions
This paper introduces Grouped Lattice Vector Quantization (GLVQ) for low-bit LLM compression, which assigns customized lattice codebooks to weight groups using learnable generation matrices. By employing Babai rounding for non-differentiable quantization and efficient matrix-vector multiplication for decoding, it significantly reduces LLM resource demands without substantial performance loss, making LLMs more accessible for inference.
Business Value
Enables the deployment of powerful LLMs on resource-constrained devices or at a lower cost, expanding their accessibility for various applications and reducing operational expenses for inference.