arxiv_ai 95% Match Research Paper ML Researchers,Deep Learning Engineers,AI System Developers 1 week ago

Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression

large-language-models › model-architecture

📄 Abstract

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities but typically require extensive computational resources and memory for inference. Post-training quantization (PTQ) can effectively reduce these demands by storing weights in lower bit-width formats. However, standard uniform quantization often leads to notable performance degradation, particularly in low-bit scenarios. In this work, we introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook, defined by a learnable generation matrix. To address the non-differentiability of the quantization process, we adopt Babai rounding to approximate nearest-lattice-point search during training, which enables stable optimization of the generation matrices. Once trained, decoding reduces to a simple matrix-vector multiplication, yielding an efficient and practical quantization pipeline. Experiments on multiple benchmarks show that our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines, highlighting its effectiveness in deploying large models under stringent resource constraints. Our source code is available on GitHub repository: https://github.com/xzhang9308/GLVQ.

Authors (4)

Xi Zhang

Xiaolin Wu

Jiamang Wang

Weisi Lin

Submitted

October 23, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

This paper introduces Grouped Lattice Vector Quantization (GLVQ) for low-bit LLM compression, which assigns customized lattice codebooks to weight groups using learnable generation matrices. By employing Babai rounding for non-differentiable quantization and efficient matrix-vector multiplication for decoding, it significantly reduces LLM resource demands without substantial performance loss, making LLMs more accessible for inference.

Business Value

Enables the deployment of powerful LLMs on resource-constrained devices or at a lower cost, expanding their accessibility for various applications and reducing operational expenses for inference.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

High, as the method focuses on post-training optimization and results in efficient decoding via matrix-vector multiplication.

Limitations Addressed

Performance degradation in low-bit quantization scenarios and high computational/memory demands of LLMs during inference.

Technical Tags

LLM CompressionPost-Training QuantizationLow-Bit QuantizationVector QuantizationLattice CodebooksBabai RoundingWeight QuantizationDeep Learning CompressionModel OptimizationInference Efficiency

Research Topics

Model CompressionEfficient Deep LearningQuantization TechniquesLarge Language ModelsHardware Acceleration

Methods & Architectures

Grouped Lattice Vector Quantization (GLVQ)Babai roundingMatrix-vector multiplication Large Language Models (LLMs)

Applications & Tasks

Natural Language Processing Machine Learning Inference High computational resource requirementsHigh memory requirementsPerformance degradation in low-bit quantization LLM inferenceModel compression

Related Fields

Machine LearningDeep LearningNatural Language ProcessingComputer Architecture

Keywords

LLMcompressionquantizationlow-bitvector quantizationlatticeBabai roundinginferencemodel sizeefficiencydeep learningtransformer

Academic Context

#Model Compression#Efficient Deep Learning#Quantization Techniques#Large Language Models#Hardware Acceleration

Commercial Potential

Potential Products

Optimized LLM inference enginesOn-device LLM deployment solutions

Target Industries

TechnologySoftware DevelopmentCloud Computing

Use Case Examples

Running LLMs on mobile devicesReducing server costs for LLM APIs

Competitive Edge

Offers a novel quantization approach (GLVQ) that aims to outperform standard uniform quantization and other PTQ methods in low-bit scenarios by using customized lattice codebooks.

Market Opportunity

Large and growing market for efficient LLM deployment.

Revenue Models

Licensing of optimized models or inference engines.

Resource Requirements

Compute Needs

Reduced inference compute requirements due to model compression.

Data Requirements

Requires datasets for training the quantization parameters.

Deployment Constraints

Potential trade-off between compression ratio and accuracy, though the paper aims to minimize this.

Scalability

The method is designed to scale with LLM size by compressing weights.

Production Readiness

Maturity Level

Research

Time to Market

Medium-term, pending further optimization and integration.

View Full Paper Back to Papers