Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 95% Match Research Paper ML engineers,AI researchers,Deep learning practitioners,LLM developers 1 month ago

Randomized Gradient Subspaces for Efficient Large Language Model Training

large-language-models › training-methods
📄 Abstract

Abstract: Training large language models (LLMs) is often bottlenecked by extreme memory demands, with optimizer states dominating the footprint. Recent works mitigates this cost by projecting gradients into low-dimensional subspaces using sophisticated update strategies. In this paper, we analyze the dynamics of gradient space and its underlying subspaces. We find that while a small subspace captures most gradient energy, a significant portion still resides in the residual bulk; moreover, the influence of the core subspace diminishes over time and in deeper layers. We also observe that the gradient space exhibits near-flat curvature, calling for algorithms that explicitly account for this geometry. Motivated by these insights, we introduce a suite of randomized algorithms, GrassWalk and GrassJump, which exploit subspace and achieve state-of-the-art memory savings while improving performance on LLaMA-1B and LLaMA-7B pretraining.

Key Contributions

This paper introduces GrassWalk and GrassJump, randomized algorithms that exploit gradient subspaces for efficient LLM training, achieving significant memory savings and improving performance on LLaMA models. It provides novel insights into gradient space dynamics and curvature.

Business Value

Enables training of larger and more capable language models on existing hardware, reducing computational costs and democratizing access to advanced AI.

View Code on GitHub