Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The success of today's large language models (LLMs) depends on the
observation that larger models perform better. However, the origin of this
neural scaling law, that loss decreases as a power law with model size, remains
unclear. We propose that representation superposition, meaning that LLMs
represent more features than they have dimensions, can be a key contributor to
loss and cause neural scaling. Based on Anthropic's toy model, we use weight
decay to control the degree of superposition, allowing us to systematically
study how loss scales with model size. When superposition is weak, the loss
follows a power law only if data feature frequencies are power-law distributed.
In contrast, under strong superposition, the loss generically scales inversely
with model dimension across a broad class of frequency distributions, due to
geometric overlaps between representation vectors. We confirmed that
open-sourced LLMs operate in the strong superposition regime and have loss
scaling like one over the model dimension, and that the Chinchilla scaling laws
are also consistent with this behavior. Our results identify representation
superposition as a central driver of neural scaling laws, providing insights
into questions like when neural scaling laws can be improved and when they will
break down.
Authors (3)
Yizhou Liu
Ziming Liu
Jeff Gore
Key Contributions
Proposes that representation superposition (LLMs representing more features than dimensions) is a key contributor to neural scaling laws. It shows that under strong superposition, loss generically scales inversely with model dimension due to geometric overlaps, explaining why larger models perform better across various data distributions, and confirms this holds for open-sourced LLMs.
Business Value
A deeper theoretical understanding of scaling laws can guide more efficient model design and training strategies, potentially leading to better performance with fewer resources or enabling predictable performance improvements.