arxiv_ml 70% Match Research Paper Machine learning researchers,Optimization experts,Deep learning practitioners 1 week ago

How do simple rotations affect the implicit bias of Adam?

large-language-models › training-methods

📄 Abstract

Abstract: Adaptive gradient methods such as Adam and Adagrad are widely used in machine learning, yet their effect on the generalization of learned models -- relative to methods like gradient descent -- remains poorly understood. Prior work on binary classification suggests that Adam exhibits a ``richness bias,'' which can help it learn nonlinear decision boundaries closer to the Bayes-optimal decision boundary relative to gradient descent. However, the coordinate-wise preconditioning scheme employed by Adam renders the overall method sensitive to orthogonal transformations of feature space. We show that this sensitivity can manifest as a reversal of Adam's competitive advantage: even small rotations of the underlying data distribution can make Adam forfeit its richness bias and converge to a linear decision boundary that is farther from the Bayes-optimal decision boundary than the one learned by gradient descent. To alleviate this issue, we show that a recently proposed reparameterization method -- which applies an orthogonal transformation to the optimization objective -- endows any first-order method with equivariance to data rotations, and we empirically demonstrate its ability to restore Adam's bias towards rich decision boundaries.

Authors (3)

Adela DePavia

Vasileios Charisopoulos

Rebecca Willett

Submitted

October 27, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Demonstrates that Adam's coordinate-wise preconditioning makes it sensitive to orthogonal transformations of the feature space, potentially reversing its generalization advantage over gradient descent. Even small rotations can cause Adam to converge to a suboptimal linear decision boundary, unlike gradient descent. The paper suggests a reparameterization technique to alleviate this issue.

Business Value

Improves the understanding of fundamental optimization algorithms, leading to more robust and reliable deep learning models in practice.

Paper Metadata

Innovation Type

Theoretical Insight

Deployment Feasibility

High, as it provides theoretical insights and suggests algorithmic modifications.

Limitations Addressed

Addresses the poor understanding of adaptive gradient methods' effect on generalization and their sensitivity to feature space transformations, which can negate their benefits.

Performance Gains

Identifies a critical weakness in Adam's robustness to feature transformations, leading to potential performance degradation.

Technical Tags

Adam optimizergradient descentgeneralizationimplicit biasrichness biascoordinate-wise preconditioningorthogonal transformationsdecision boundariesBayes-optimalreparameterization

Research Topics

Optimization AlgorithmsMachine Learning TheoryGeneralization BoundsDeep LearningAdaptive Methods

Methods & Architectures

Theoretical analysisMathematical proofsSensitivity analysis to orthogonal transformations

Applications & Tasks

Machine Learning Research Deep Learning Optimization Understanding Optimizer BehaviorImproving GeneralizationAnalyzing Implicit Bias Investigating the effect of Adam's coordinate-wise preconditioning on generalizationDemonstrating how rotations can reverse Adam's advantages over gradient descentProposing solutions to mitigate sensitivity to feature space transformations

Datasets & Benchmarks

Benchmarks

Binary classification tasks • Comparison with Gradient Descent

Generalization performanceDistance to Bayes-optimal decision boundaryConvergence rate

Related Fields

Machine LearningOptimization TheoryDeep LearningStatistics

Keywords

Adamoptimizergradient descentgeneralizationimplicit biasoptimizationdeep learningmachine learningadaptive methodsfeature spacerotationdecision boundaryrichness bias

Academic Context

#Optimization Algorithms#Machine Learning Theory#Generalization Bounds#Deep Learning#Adaptive Methods

Commercial Potential

Target Industries

TechnologyResearch and Development

Use Case Examples

Developing more robust training algorithms for deep neural networksUnderstanding why certain optimizers perform better in specific scenarios

Competitive Edge

Provides a critical theoretical analysis that highlights a potential drawback of widely used optimizers like Adam.

Market Opportunity

N/A (fundamental research).

Revenue Models

N/A.

Resource Requirements

Compute Needs

Low, primarily theoretical analysis.

Data Requirements

Synthetic datasets for demonstrating theoretical points.

Deployment Constraints

Theoretical findings need to be translated into practical algorithmic changes.

Scalability

N/A (theoretical paper).

Regulatory Considerations

None.

Production Readiness

Maturity Level

Theoretical

Time to Market

Immediate for researchers, longer for practical algorithm adoption.

Patent Potential

Low, primarily theoretical insights.

View Full Paper Back to Papers