arxiv_ai 90% Match Research Paper ML Researchers,Deep Learning Engineers,Theoretical ML Scientists 2 weeks ago

Memorization-Compression Cycles Improve Generalization

large-language-models › training-methods

📄 Abstract

Abstract: We prove theoretically that generalization improves not only through data scaling but also by compressing internal representations. To operationalize this insight, we introduce the Information Bottleneck Language Modeling (IBLM) objective, which reframes language modeling as a constrained optimization problem: minimizing representation entropy subject to optimal prediction performance. Empirically, we observe an emergent memorization-compression cycle during LLM pretraining, evidenced by oscillation positive/negative gradient alignment between cross-entropy and Matrix-Based Entropy (MBE), a measure of representation entropy. This pattern closely mirrors the predictive-compressive trade-off prescribed by IBLM and also parallels the biological alternation between awake learning and sleep consolidation. Motivated by this observation, we propose Gated Phase Transition (GAPT), a training algorithm that adaptively switches between memorization and compression phases. When applied to GPT-2 pretraining on FineWeb dataset, GAPT reduces MBE by 50% and improves cross-entropy by 4.8%. GAPT improves OOD generalizatino by 35% in a pretraining task on arithmetic multiplication. In a setting designed to simulate catastrophic forgetting, GAPT reduces interference by compressing and separating representations, achieving a 97% improvement in separation - paralleling the functional role of sleep consolidation.

Authors (1)

Fangyuan Yu

Submitted

May 13, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Theoretically proves that generalization improves via representation compression and introduces the IBLM objective. Empirically observes an emergent memorization-compression cycle during LLM pretraining and proposes GAPT, a training algorithm that adaptively switches between these phases to improve generalization.

Business Value

Leads to more robust and reliable AI models that generalize better to unseen data, reducing the need for massive datasets and improving performance in real-world applications.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

Moderate. GAPT is a training algorithm, requiring modification of training pipelines.

Limitations Addressed

Over-reliance on data scaling for generalization,Lack of theoretical understanding and practical methods for improving generalization via compression

Technical Tags

generalizationmemorizationrepresentation compressionInformation Bottlenecklanguage modelingLLM pretrainingtraining algorithmsemergent phenomenagradient alignmentpredictive-compressive trade-off

Research Topics

Machine Learning TheoryDeep LearningLanguage ModelingGeneralization Theory

Methods & Architectures

Information Bottleneck Language Modeling (IBLM) objectiveMatrix-Based Entropy (MBE)Gated Phase Transition (GAPT) training algorithm Large Language Models (LLMs)

Applications & Tasks

AI Model Training Machine Learning Theory Improving generalization in LLMsUnderstanding the interplay between memorization and compression Enhancing LLM generalization through representation compressionDeveloping novel training algorithms for better generalization

Related Fields

Machine Learning TheoryInformation TheoryDeep LearningCognitive Science

Keywords

generalizationmemorizationrepresentation compressionInformation Bottlenecklanguage modelingLLM pretrainingtraining algorithmsemergent phenomenagradient alignmentpredictive-compressive trade-offGAPTIBLM

Academic Context

#Machine Learning Theory#Deep Learning#Language Modeling#Generalization Theory

Commercial Potential

Potential Products

More efficient and generalizable LLM training frameworksAI models with improved robustness

Target Industries

TechnologyAI ResearchSoftware Development

Use Case Examples

Training LLMs that require less data for good generalizationDeveloping AI models that are less prone to overfitting

Competitive Edge

Offers a theoretical and algorithmic framework (IBLM, GAPT) to improve LLM generalization beyond data scaling, by actively managing the memorization-compression trade-off during training.

Market Opportunity

Fundamental research impacting the entire LLM market.

Revenue Models

Licensing of training methodologiesconsulting on model optimization.

Resource Requirements

Compute Needs

High (for LLM pretraining)

Data Requirements

Large text corpora for LLM pretraining

Deployment Constraints

Complexity of implementing GAPT training algorithm,Requires significant computational resources for training

Scalability

Aims to improve generalization, which indirectly aids scalability by reducing data needs.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years

Patent Potential

Moderate (novel training algorithm)

View Full Paper Back to Papers