Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Low-Rank Adaptation (LoRA) has become a popular technique for
parameter-efficient fine-tuning of large language models (LLMs). In many
real-world scenarios, multiple adapters are loaded simultaneously to enable LLM
customization for personalized user experiences or to support a diverse range
of tasks. Although each adapter is lightweight in isolation, their aggregate
cost becomes substantial at scale. To address this, we propose LoRAQuant, a
mixed-precision post-training quantization method tailored to LoRA.
Specifically, LoRAQuant reparameterizes each adapter by singular value
decomposition (SVD) to concentrate the most important information into specific
rows and columns. This makes it possible to quantize the important components
to higher precision, while quantizing the rest to ultra-low bitwidth. We
conduct comprehensive experiments with LLaMA 2-7B, LLaMA 2-13B, and Mistral 7B
models on mathematical reasoning, coding, and summarization tasks. Results show
that our LoRAQuant uses significantly lower bits than other quantization
methods, but achieves comparable or even higher performance.
Authors (4)
Amir Reza Mirzaei
Yuqiao Wen
Yanshuai Cao
Lili Mou
Submitted
October 30, 2025
Key Contributions
LoRAQuant proposes a novel mixed-precision post-training quantization method for LoRA adapters by reparameterizing them with SVD. This allows concentrating important information into higher precision components while quantizing the rest to ultra-low bitwidths, significantly reducing the aggregate cost of multiple adapters without substantial performance degradation.
Business Value
Enables more efficient deployment and scaling of personalized LLM services by reducing the memory and computational footprint of fine-tuned adapters, leading to cost savings and improved user experience.