On the Convergence of Adam and Beyond

Sashank J. Reddi,Satyen Kale,Surinder Kumar

Published 2018 in International Conference on Learning Representations

ABSTRACT

Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients. In many applications, e.g. learning with large output spaces, it has been empirically observed that these algorithms fail to converge to an optimal solution (or a critical point in nonconvex settings). We show that one cause for such failures is the exponential moving average used in the algorithms. We provide an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution, and describe the precise problems with the previous analysis of Adam algorithm. Our analysis suggests that the convergence issues can be fixed by endowing such algorithms with `long-term memory' of past gradients, and propose new variants of the Adam algorithm which not only fix the convergence issues but often also lead to improved empirical performance.

PUBLICATION RECORD

Publication year
2018
Venue
International Conference on Learning Representations
Publication date
2018-02-15
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1904.09237
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Incorporating Nesterov Momentum into Adam
2016influential reference
Adam: A Method for Stochastic Optimization
2014influential reference
Dropout: a simple way to prevent neural networks from overfitting
2014influential reference
ImageNet classification with deep convolutional neural networks
2012cited by this paper
ADADELTA: An Adaptive Learning Rate Method
2012influential reference
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
2011influential reference
Adaptive Bound Optimization for Online Convex Optimization
2010cited by this paper
Online Convex Programming and Generalized Infinitesimal Gradient Ascent
2003cited by this paper
On the generalization ability of on-line learning algorithms
2001cited by this paper
Adaptive and Self-Confident On-Line Learning Algorithms
2000cited by this paper

CITED BY

Decision Making under Imperfect Recall: Algorithms and Benchmarks
2026cites this paper
Secure Communication in MIMOME Movable-Antenna Systems with Statistical Eavesdropper CSI
2026influential citation
An augmented physics-informed neural network approach with trainable scaling for nonlinear dynamic analysis
2026cites this paper
PrivacyBench: Privacy Isn't Free in Hybrid Privacy-Preserving Vision Systems
2026cites this paper
Stability and Generalization of Nonconvex Optimization with Heavy-Tailed Noise
2026cites this paper
Ordered Local Momentum for Asynchronous Distributed Learning under Arbitrary Delays
2026cites this paper
Emergence of Phonemic, Syntactic, and Semantic Representations in Artificial Neural Networks
2026cites this paper
Plasma density estimation from ionograms and geophysical parameters with deep learning
2026cites this paper
Inelastic Constitutive Kolmogorov-Arnold Networks: A generalized framework for automated discovery of interpretable inelastic material models
2026cites this paper
The Power of Decaying Steps: Enhancing Attack Stability and Transferability for Sign-based Optimizers
2026influential citation
Characterization of the Polarization Beam Response of SPT-3G Using Point Sources
2026cites this paper
Convergence of Muon with Newton-Schulz
2026cites this paper
AccSPS Learning Rate: Accelerated Convergence Through Decision-Adjusted Levels for Stochastic Polyak Stepsize
2026cites this paper
Physics-Informed Neural Network Prediction of Thermophysical Properties for Propyl Butyrate + 1-Alkanol (C6–C10)
2026cites this paper
Asymptotic Convergence and Stability of Adaptive Gradient Methods in Smooth Non-convex Optimization
2026cites this paper
Variational oblique predictive clustering trees
2026cites this paper
LMAdam: Enhancing Adam via Linear Multistep Discretization
2026cites this paper
Semantic-level Backdoor Attack against Text-to-Image Diffusion Models
2026influential citation
Impact of Optimizers on Transformer Models for Classification of Olive Fruit Disease
2026cites this paper
Fusion architectures for soft rot detection in melon plants using hyperspectral and multicolor fluorescence imaging
2026cites this paper
HomeAdam: Adam and AdamW Algorithms Sometimes Go Home to Obtain Better Provable Generalization
2026cites this paper
A robust pointing recalibration method for optical satellite tracking applications
2026cites this paper
WSBD: Freezing-Based Optimizer for Quantum Neural Networks
2026cites this paper
Dynamic Regret via Discounted-to-Dynamic Reduction with Applications to Curved Losses and Adam Optimizer
2026cites this paper
Riemannian Lyapunov Optimizer: A Unified Framework for Optimization
2026cites this paper
Dynamic occupancy-aware HVAC control in large office building using enhanced soft actor-critic with PV and thermal energy storage integration
2026cites this paper
A FEM-ANN framework to estimate the on-diagonal elements of the impedance matrix in a Cochlear Implant
2026cites this paper
Rapid Offline Training for Deep Material Networks via a displacement-based laminate formulation and a novel sampling technique for a compliance-based fatigue model
2026cites this paper
Adaptive Moment Estimation-Based Model Predictive Control for PMSM Motor With Low Tracking Error
2026cites this paper
EF21 With Momentum and Partial Participation for Non-Convex Federated Learning Under Biased Compression
2026cites this paper
A dual encoder-decoder multi-task 3D deep learning framework for the segmentation of focal cortical dysplasia lesions
2026cites this paper
Versatile Learning without Synaptic Plasticity in a Spiking Neural Network
2026cites this paper
Machine-learning enabled characterization of individual ring resonators in integrated photonic lattices
2026cites this paper
Enhancing human pose estimation accuracy with pyramid fusion Vision Transformers
2026cites this paper
On the Convergence of HalpernSGD
2026cites this paper
Multi-subgraph fusion: an innovative approach for block matrix graph convolutional networks
2026cites this paper
Convergence of Multi-Level Markov Chain Monte Carlo Adaptive Stochastic Gradient Algorithms
2026influential citation
Data-driven model order reduction for accelerating boundary plasma turbulence simulations
2026cites this paper
SVD-Preconditioned Gradient Descent Method for Solving Nonlinear Least Squares Problems
2026influential citation
SCRAPL: Scattering Transform with Random Paths for Machine Learning
2026cites this paper
Learning Gradient Flow: Using Equation Discovery to Accelerate Engineering Optimization
2026cites this paper
Adaptive Decentralized Composite Optimization via Three-Operator Splitting
2026cites this paper
DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models
2026cites this paper
A unified theoretical framework for the last-iterate convergence of stochastic adaptive optimization
2026cites this paper
Same Error, Different Function: The Optimizer as an Implicit Prior in Financial Time Series
2026cites this paper
Regular Fourier Features for Nonstationary Gaussian Processes
2026cites this paper
Fast Compute for ML Optimization
2026cites this paper
AltTS: A Dual-Path Framework with Alternating Optimization for Multivariate Time Series Forecasting
2026cites this paper
From Adam to Adam-Like Lagrangians: Second-Order Nonlocal Dynamics
2026cites this paper
ORCHID: Fairness-Aware Orchestration in Mission-Critical Air-Ground Integrated Networks
2026cites this paper
A general learning rate improvement strategy for deep neural networks training
2026cites this paper
Breaking the Stochasticity Barrier: An Adaptive Variance-Reduced Method for Variational Inequalities
2026cites this paper
FedAdaVR: Adaptive Variance Reduction for Robust Federated Learning under Limited Client Participation
2026cites this paper
Leveraging Second-Order Curvature for Efficient Learned Image Compression: Theory and Empirical Evidence
2026cites this paper
NeuroDetect: Deep Learning-Based Signal Detection in Phase-Modulated Systems with Low-Resolution Quantization
2025cites this paper
ADAM Optimization with Adaptive Batch Selection
2025influential citation
Dynamic bound adaptive gradient methods with belief in observed gradients
2025cites this paper
VAMO: Efficient Zeroth-Order Variance Reduction for SGD with Faster Convergence
2025cites this paper
ViTYoga: Vision Transformer for Real-Time Yoga Pose Estimation and Analysis
2025cites this paper
Communication-Efficient Distributed Online Nonconvex Optimization with Time-Varying Constraints
2025cites this paper
HOME-3: High-Order Momentum Estimator with Third-Power Gradient for Convex and Smooth Nonconvex Optimization
2025influential citation
On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm
2025cites this paper
A Langevin sampling algorithm inspired by the Adam optimizer
2025cites this paper
Accurate and Efficient LiDAR SLAM by Learning Unified Neural Descriptors
2025cites this paper
Sharp higher order convergence rates for the Adam optimizer
2025cites this paper
Adaptive adam-based optimizers using second-order weight decoupling and gradient-aware weight decay for vision transformer
2025cites this paper
A universal spin–orbit-coupled Hamiltonian model for accelerated quantum material discovery
2025cites this paper
Using adaptive learning and momentum to improve generalization
2025cites this paper
A Physics-Inspired Optimizer: Velocity Regularized Adam
2025cites this paper
Adaptive Huber-type smoothing algorithm for change point detection in quantile regression
2025cites this paper
Federated Graph Learning via Constructing and Sharing Feature Spaces for Cross-Domain IoT
2025cites this paper
Equilibrium Conserving Neural Operators for Super-Resolution Learning
2025cites this paper
HVAdam: A Full-Dimension Adaptive Optimizer
2025cites this paper
SAFE-NID: Self-Attention with Normalizing-Flow Encodings for Network Intrusion Detection
2025cites this paper
Hardware Co-Designed Optimal Control for Programmable Atomic Quantum Processors via Reinforcement Learning
2025cites this paper
AlphaGrad: Non-Linear Gradient Normalization Optimizer
2025cites this paper
WaveNet-Volterra Neural Networks for Active Noise Control: A Fully Causal Approach
2025cites this paper
Enhanced metabolomic predictions using concept drift analysis: identification and correction of confounding factors
2025cites this paper
PatSimBoosting: Enhancing Patient Representations for Disease Prediction Through Similarity Analysis
2025cites this paper
Toward Accurate Deep Learning-Based Prediction of Ki67, ER, PR, and HER2 Status From H&E-Stained Breast Cancer Images
2025cites this paper
Convergence of Adaptive Stochastic Mirror Descent
2025influential citation
Force-Free Molecular Dynamics Through Autoregressive Equivariant Networks
2025cites this paper
P-Order: A Unified Convergence-Analysis Framework for Multivariate Iterative Methods
2025cites this paper
Preconditioning Natural and Second Order Gradient Descent in Quantum Optimization: A Performance Benchmark
2025cites this paper
Convergence of Adam in Deep ReLU Networks via Directional Complexity and Kakeya Bounds
2025cites this paper
A decentralized adaptive method with consensus step for non-convex non-concave min-max optimization problems
2025cites this paper
On the Convergence of Adam-Type Algorithm for Bilevel Optimization under Unbounded Smoothness
2025cites this paper
A Triple-Inertial Accelerated Alternating Optimization Method for Deep Learning Training
2025cites this paper
Adaptive moment estimation optimization algorithm using projection gradient for deep learning
2025cites this paper
Distance-Informed Neural Eikonal Solver for Reactive Dynamic User-Equilibrium of Macroscopic Continuum Traffic Flow Model
2025cites this paper
Timely and Energy-Efficient Information Delivery in Heterogeneous Correlated Random Access Networks
2025cites this paper
Enhancing PM2.5\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{2.5}$$\end{document} Forecasting Models: Leveraging S
2025cites this paper
Identifying Disease-Gene Associations by Topological and Biological Feature-based Data Augmentation and Graph Neural Networks
2025influential citation
Learning Rate Annealing Improves Tuning Robustness in Stochastic Optimization
2025cites this paper
Optimizing Ansatz Design in Quantum Generative Adversarial Networks Using Large Language Models
2025cites this paper
Revisiting Stochastic Multi-Level Compositional Optimization
2025cites this paper
Deep Feynman-Kac Methods for High-dimensional Semilinear Parabolic Equations: Revisit
2025cites this paper
Learning Semantic Part-Based Graph Structure for 3D Point Cloud Domain Generalization
2025cites this paper
Momentum-Based Iterative Hard Thresholding Algorithm for Sparse Signal Recovery
2025cites this paper
Interpolation-based coordinate descent method for parameterized quantum circuits
2025cites this paper