arxiv_ml 70% Match Research Paper Machine Learning Theorists,Researchers in Deep Learning,Students of AI 1 week ago

Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks

reinforcement-learning › robotics-rl

📄 Abstract

Abstract: What features neural networks learn, and how, remains an open question. In this paper, we introduce Alternating Gradient Flows (AGF), an algorithmic framework that describes the dynamics of feature learning in two-layer networks trained from small initialization. Prior works have shown that gradient flow in this regime exhibits a staircase-like loss curve, alternating between plateaus where neurons slowly align to useful directions and sharp drops where neurons rapidly grow in norm. AGF approximates this behavior as an alternating two-step process: maximizing a utility function over dormant neurons and minimizing a cost function over active ones. AGF begins with all neurons dormant. At each iteration, a dormant neuron activates, triggering the acquisition of a feature and a drop in the loss. AGF quantifies the order, timing, and magnitude of these drops, matching experiments across several commonly studied architectures. We show that AGF unifies and extends existing saddle-to-saddle analyses in fully connected linear networks and attention-only linear transformers, where the learned features are singular modes and principal components, respectively. In diagonal linear networks, we prove AGF converges to gradient flow in the limit of vanishing initialization. Applying AGF to quadratic networks trained to perform modular addition, we give the first complete characterization of the training dynamics, revealing that networks learn Fourier features in decreasing order of coefficient magnitude. Altogether, AGF offers a promising step towards understanding feature learning in neural networks.

Authors (8)

Daniel Kunin

Giovanni Luca Marchetti

Feng Chen

Dhruva Karkada

James B. Simon

Michael R. DeWeese

+2 more

Submitted

June 6, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Introduces Alternating Gradient Flows (AGF), a theoretical framework that models feature learning dynamics in two-layer neural networks. AGF approximates the observed staircase-like loss curve as an alternating process of maximizing utility for dormant neurons and minimizing cost for active ones, quantifying the order, timing, and magnitude of loss drops.

Business Value

Provides a deeper theoretical understanding of neural network learning, which can lead to more interpretable models and potentially more efficient training strategies in the future.

Paper Metadata

Innovation Type

Theoretical Framework

Deployment Feasibility

Low. This is a theoretical framework, not a direct implementation for deployment.

Limitations Addressed

The lack of clear understanding of what features neural networks learn and how they learn them, particularly in the context of gradient flow dynamics.

Technical Tags

feature learningtwo-layer networksgradient flowalternating gradient flowsdormant neuronsactive neuronsloss landscapenorm growthutility functioncost function

Research Topics

Neural Network DynamicsFeature RepresentationOptimization TheoryDeep Learning TheoryAlgorithmic Analysis

Methods & Architectures

Alternating Gradient Flows (AGF)Gradient Flow AnalysisMathematical ModelingApproximation Theory Two-layer Neural Networks

Applications & Tasks

General Machine Learning Theoretical Neuroscience Understanding Feature LearningAnalyzing Neural Network Dynamics Feature ExtractionModel Interpretation

Related Fields

Deep Learning TheoryOptimizationComputational NeuroscienceMachine Learning Theory

Keywords

feature learningneural networksgradient flowalternating gradient flowstwo-layer networksloss dynamicsneuron activationoptimizationdeep learning theoryrepresentation learning

Academic Context

#Neural Network Dynamics#Feature Representation#Optimization Theory#Deep Learning Theory#Algorithmic Analysis

Commercial Potential

Use Case Examples

Analyzing the learning process of simple neural networksDeveloping new optimization algorithms based on theoretical insights

Competitive Edge

Extends and unifies existing theories on gradient flow dynamics in two-layer networks.

Market Opportunity

N/A

Revenue Models

N/A

Resource Requirements

Compute Needs

Low (for theoretical analysis and simulations)

Data Requirements

None (theoretical study)

Deployment Constraints

Limited to theoretical analysis and small-scale simulations.

Scalability

Focuses on two-layer networks, scalability to deeper networks is not directly addressed.

Production Readiness

Maturity Level

Theoretical

Time to Market

N/A

Patent Potential

Very Low

View Full Paper Back to Papers