Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 95% Match Research Paper ML Engineers,AI Researchers,Developers working with LLMs,Researchers in model compression 3 weeks ago

BitNet Distillation

large-language-models › model-architecture
📄 Abstract

Abstract: In this paper, we present BitNet Distillation (BitDistill), a lightweight pipeline that fine-tunes off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit precision (i.e., ternary weights {-1, 0, 1}) for specific downstream tasks, achieving strong task-specific performance with minimal computational cost. Specifically, BitDistill incorporates three key techniques: the SubLN module, as introduced in BitNet; multi-head attention distillation, based on MiniLM; and continual pre-training, which serves as a crucial warm-up step to mitigate the scalability issue of the performance gap between finetuned full-precision and 1.58-bit LLMs on specific tasks. Experimental results show that BitDistill achieves performance comparable to the full-precision counterpart models across model size, while enabling up to 10x memory savings and 2.65x faster inference on CPUs. Code is available at https://github.com/microsoft/BitNet.
Authors (7)
Xun Wu
Shaohan Huang
Wenhui Wang
Ting Song
Li Dong
Yan Xia
+1 more
Submitted
October 15, 2025
arXiv Category
cs.LG
arXiv PDF Code

Key Contributions

Presents BitNet Distillation (BitDistill), a lightweight pipeline to fine-tune full-precision LLMs into 1.58-bit precision (ternary weights) for specific tasks. It incorporates the SubLN module, multi-head attention distillation, and continual pre-training to mitigate performance gaps and achieve strong task-specific performance with significant computational savings.

Business Value

Makes powerful LLMs accessible on resource-constrained devices (like CPUs) by drastically reducing their memory and computational requirements, enabling wider adoption of AI.

View Code on GitHub