Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Training large language models (LLMs) is often bottlenecked by extreme memory
demands, with optimizer states dominating the footprint. Recent works mitigates
this cost by projecting gradients into low-dimensional subspaces using
sophisticated update strategies. In this paper, we analyze the dynamics of
gradient space and its underlying subspaces. We find that while a small
subspace captures most gradient energy, a significant portion still resides in
the residual bulk; moreover, the influence of the core subspace diminishes over
time and in deeper layers. We also observe that the gradient space exhibits
near-flat curvature, calling for algorithms that explicitly account for this
geometry. Motivated by these insights, we introduce a suite of randomized
algorithms, GrassWalk and GrassJump, which exploit subspace and achieve
state-of-the-art memory savings while improving performance on LLaMA-1B and
LLaMA-7B pretraining.
Key Contributions
This paper introduces GrassWalk and GrassJump, randomized algorithms that exploit gradient subspaces for efficient LLM training, achieving significant memory savings and improving performance on LLaMA models. It provides novel insights into gradient space dynamics and curvature.
Business Value
Enables training of larger and more capable language models on existing hardware, reducing computational costs and democratizing access to advanced AI.