Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Scaling laws play a central role in the success of Large Language Models
(LLMs), enabling the prediction of model performance relative to compute
budgets prior to training. While Transformers have been the dominant
architecture, recent alternatives such as xLSTM offer linear complexity with
respect to context length while remaining competitive in the billion-parameter
regime. We conduct a comparative investigation on the scaling behavior of
Transformers and xLSTM along the following lines, providing insights to guide
future model design and deployment. First, we study the scaling behavior for
xLSTM in compute-optimal and over-training regimes using both IsoFLOP and
parametric fit approaches on a wide range of model sizes (80M-7B) and number of
training tokens (2B-2T). Second, we examine the dependence of optimal model
sizes on context length, a pivotal aspect that was largely ignored in previous
work. Finally, we analyze inference-time scaling characteristics. Our findings
reveal that in typical LLM training and inference scenarios, xLSTM scales
favorably compared to Transformers. Importantly, xLSTM's advantage widens as
training and inference contexts grow.
Key Contributions
This paper investigates the scaling laws of xLSTM compared to Transformers, demonstrating competitive performance with linear time complexity. It provides insights into compute-optimal and over-training regimes, examines the dependence of optimal model size on context length, and analyzes inference-time scaling, guiding future LLM design and deployment.
Business Value
Enables more efficient training and deployment of large language models by providing predictable performance scaling, potentially reducing computational costs and enabling faster inference for applications.