arxiv_ai 92% Match Research Paper ML Researchers,AI Theorists,LLM Developers 2 weeks ago

Superposition Yields Robust Neural Scaling

large-language-models › model-architecture

📄 Abstract

Abstract: The success of today's large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law, that loss decreases as a power law with model size, remains unclear. We propose that representation superposition, meaning that LLMs represent more features than they have dimensions, can be a key contributor to loss and cause neural scaling. Based on Anthropic's toy model, we use weight decay to control the degree of superposition, allowing us to systematically study how loss scales with model size. When superposition is weak, the loss follows a power law only if data feature frequencies are power-law distributed. In contrast, under strong superposition, the loss generically scales inversely with model dimension across a broad class of frequency distributions, due to geometric overlaps between representation vectors. We confirmed that open-sourced LLMs operate in the strong superposition regime and have loss scaling like one over the model dimension, and that the Chinchilla scaling laws are also consistent with this behavior. Our results identify representation superposition as a central driver of neural scaling laws, providing insights into questions like when neural scaling laws can be improved and when they will break down.

Authors (3)

Yizhou Liu

Ziming Liu

Jeff Gore

Submitted

May 15, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Proposes that representation superposition (LLMs representing more features than dimensions) is a key contributor to neural scaling laws. It shows that under strong superposition, loss generically scales inversely with model dimension due to geometric overlaps, explaining why larger models perform better across various data distributions, and confirms this holds for open-sourced LLMs.

Business Value

A deeper theoretical understanding of scaling laws can guide more efficient model design and training strategies, potentially leading to better performance with fewer resources or enabling predictable performance improvements.

Paper Metadata

Innovation Type

Theoretical

Deployment Feasibility

High, as it's a theoretical framework that informs model design and training practices.

Limitations Addressed

Addresses the unclear origin of neural scaling laws (loss decreasing as a power law with model size). Explains why scaling works even when data feature frequencies are not power-law distributed.

Performance Gains

Provides a theoretical explanation for observed performance gains with increased model size (neural scaling).

Technical Tags

Neural Scaling LawsRepresentation SuperpositionLarge Language Models (LLMs)Loss ScalingFeature RepresentationWeight DecayGeometric OverlapsPower Law

Research Topics

Theoretical Machine LearningLarge Model ScalingRepresentation LearningDeep Learning Theory

Methods & Architectures

Theoretical analysis of superpositionWeight decay for controlling superpositionEmpirical validation on open-sourced LLMsAnalysis of loss scaling with model size Large Language Models (LLMs)Transformers

Applications & Tasks

Machine Learning Theory LLM Development Explaining neural scaling lawsUnderstanding the role of representation superpositionPredicting model performance based on size Analyzing the origins of scaling lawsInvestigating feature representation in LLMsTheoretically predicting model behavior

Related Fields

Machine Learning TheoryDeep LearningRepresentation LearningInformation Theory

Keywords

Neural ScalingLLMSuperpositionRepresentation LearningLoss ScalingPower LawModel SizeFeature RepresentationWeight DecayGeometric OverlapDeep Learning Theory

Academic Context

#Theoretical Machine Learning#Large Model Scaling#Representation Learning#Deep Learning Theory

Companies & Organizations

Companies Mentioned

Anthropic

Commercial Potential

Competitive Edge

Offers a novel theoretical explanation for a fundamental phenomenon (neural scaling) in large models, distinct from empirical observations alone.

Market Opportunity

Massive market for large language models and their development.

Resource Requirements

Compute Needs

Low (for theoretical analysis and running experiments on existing models)

Data Requirements

Access to open-sourced LLMs and their training dynamics

Scalability

The theory explains the scaling behavior of models.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers