Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The typical Selective State-Space Model (SSM) used in Mamba addresses several
limitations of Transformers, such as the quadratic computational complexity
with respect to sequence length and the significant memory requirements during
inference due to the key-value (KV) cache. However, the increasing size of
Mamba models continues to pose challenges for training and deployment,
particularly due to their substantial computational demands during both
training and inference. In this work, we introduce $\texttt{Bi-Mamba}$, a
scalable and powerful 1-bit Mamba architecture designed to enable more
efficient large language models (LLMs), with model sizes of 780M, 1.3B, and
2.7B parameters. $\texttt{Bi-Mamba}$ models are trained from scratch on a
standard LLM-scale dataset using an autoregressive distillation loss. Extensive
experiments on language modeling benchmarks demonstrate that
$\texttt{Bi-Mamba}$ achieves performance comparable to its full-precision (FP16
or BF16) counterparts, while outperforming post-training binarization (PTB)
Mamba and binarization-aware training (BAT) Transformer baselines. Moreover,
$\texttt{Bi-Mamba}$ drastically reduces memory usage and computational cost
compared to the original Mamba. Our work pioneers a new line of
linear-complexity LLMs under low-bit representation and provides the way for
the design of specialized hardware optimized for efficient 1-bit Mamba-based
models. Code and the pre-trained weights are available at
https://github.com/Tangshengku/Bi-Mamba.
Authors (5)
Shengkun Tang
Liqun Ma
Haonan Li
Mingjie Sun
Zhiqiang Shen
Submitted
November 18, 2024
Key Contributions
Introduces Bi-Mamba, a scalable 1-bit Mamba architecture designed for efficient large language models. Bi-Mamba models achieve performance comparable to full-precision counterparts while significantly reducing computational demands and memory requirements, enabling more efficient LLM training and deployment.
Business Value
Makes powerful LLMs more accessible and cost-effective to train and deploy by drastically reducing their computational footprint and memory usage.