Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper ML Engineers,AI Researchers,Hardware Designers,Mobile Developers,Edge AI Specialists 1 week ago

LittleBit: Ultra Low-Bit Quantization via Latent Factorization

large-language-models › model-architecture
📄 Abstract

Abstract: Deploying large language models (LLMs) often faces challenges from substantial memory and computational costs. Quantization offers a solution, yet performance degradation in the sub-1-bit regime remains particularly difficult. This paper introduces LittleBit, a novel method for extreme LLM compression. It targets levels like 0.1 bits per weight (BPW), achieving nearly 31$\times$ memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents weights in a low-rank form using latent matrix factorization, subsequently binarizing these factors. To counteract information loss from this extreme precision, it integrates a multi-scale compensation mechanism. This includes row, column, and an additional latent dimension that learns per-rank importance. Two key contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training (QAT) initialization, and integrated Residual Compensation to mitigate errors. Extensive experiments confirm LittleBit's superiority in sub-1-bit quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading method's 0.7 BPW. LittleBit establishes a new, viable size-performance trade-off--unlocking a potential 11.6$\times$ speedup over FP16 at the kernel level--and makes powerful LLMs practical for resource-constrained environments.
Authors (4)
Banseok Lee
Dongkyu Kim
Youngcheon You
Youngmin Kim
Submitted
May 30, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

Introduces LittleBit, a novel method for extreme LLM compression targeting sub-1-bit quantization (e.g., 0.1 BPW) via latent matrix factorization and binarization. It employs multi-scale compensation and specialized QAT techniques (Dual-SVID, Residual Compensation) to mitigate performance loss.

Business Value

Enables the deployment of powerful LLMs on devices with limited memory and computational power (e.g., mobile phones, edge devices), democratizing access to advanced AI capabilities and reducing operational costs.