Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper ML Researchers,Deep Learning Engineers,AI System Developers 1 week ago

Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression

large-language-models › model-architecture
📄 Abstract

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities but typically require extensive computational resources and memory for inference. Post-training quantization (PTQ) can effectively reduce these demands by storing weights in lower bit-width formats. However, standard uniform quantization often leads to notable performance degradation, particularly in low-bit scenarios. In this work, we introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook, defined by a learnable generation matrix. To address the non-differentiability of the quantization process, we adopt Babai rounding to approximate nearest-lattice-point search during training, which enables stable optimization of the generation matrices. Once trained, decoding reduces to a simple matrix-vector multiplication, yielding an efficient and practical quantization pipeline. Experiments on multiple benchmarks show that our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines, highlighting its effectiveness in deploying large models under stringent resource constraints. Our source code is available on GitHub repository: https://github.com/xzhang9308/GLVQ.
Authors (4)
Xi Zhang
Xiaolin Wu
Jiamang Wang
Weisi Lin
Submitted
October 23, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

This paper introduces Grouped Lattice Vector Quantization (GLVQ) for low-bit LLM compression, which assigns customized lattice codebooks to weight groups using learnable generation matrices. By employing Babai rounding for non-differentiable quantization and efficient matrix-vector multiplication for decoding, it significantly reduces LLM resource demands without substantial performance loss, making LLMs more accessible for inference.

Business Value

Enables the deployment of powerful LLMs on resource-constrained devices or at a lower cost, expanding their accessibility for various applications and reducing operational expenses for inference.