Towards Understanding Convergence and Generalization of AdamW

Pan Zhou,Xingyu Xie,Zhouchen Lin,Shuicheng Yan

Published 2024 in IEEE Transactions on Pattern Analysis and Machine Intelligence

ABSTRACT

AdamW modifies Adam by adding a decoupled weight decay to decay network weights per training iteration. For adaptive algorithms, this decoupled weight decay does not affect specific optimization steps, and differs from the widely used <inline-formula><tex-math notation="LaTeX">$\ell _{2}$</tex-math><alternatives><mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="zhou-ieq1-3382294.gif"/></alternatives></inline-formula>-regularizer which changes optimization steps via changing the first- and second-order gradient moments. Despite its great practical success, for AdamW, its convergence behavior and generalization improvement over Adam and <inline-formula><tex-math notation="LaTeX">$\ell _{2}$</tex-math><alternatives><mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="zhou-ieq2-3382294.gif"/></alternatives></inline-formula>-regularized Adam (<inline-formula><tex-math notation="LaTeX">$\ell _{2}$</tex-math><alternatives><mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="zhou-ieq3-3382294.gif"/></alternatives></inline-formula>-Adam) remain absent yet. To solve this issue, we prove the convergence of AdamW and justify its generalization advantages over Adam and <inline-formula><tex-math notation="LaTeX">$\ell _{2}$</tex-math><alternatives><mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="zhou-ieq4-3382294.gif"/></alternatives></inline-formula>-Adam. Specifically, AdamW provably converges but minimizes a dynamically regularized loss that combines vanilla loss and a dynamical regularization induced by decoupled weight decay, thus yielding different behaviors with Adam and <inline-formula><tex-math notation="LaTeX">$\ell _{2}$</tex-math><alternatives><mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="zhou-ieq5-3382294.gif"/></alternatives></inline-formula>-Adam. Moreover, on both general nonconvex problems and PŁ-conditioned problems, we establish stochastic gradient complexity of AdamW to find a stationary point. Such complexity is also applicable to Adam and <inline-formula><tex-math notation="LaTeX">$\ell _{2}$</tex-math><alternatives><mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="zhou-ieq6-3382294.gif"/></alternatives></inline-formula>-Adam, and improves their previously known complexity, especially for over-parametrized networks. Besides, we prove that AdamW enjoys smaller generalization errors than Adam and <inline-formula><tex-math notation="LaTeX">$\ell _{2}$</tex-math><alternatives><mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="zhou-ieq7-3382294.gif"/></alternatives></inline-formula>-Adam from the Bayesian posterior aspect. This result, for the first time, explicitly reveals the benefits of decoupled weight decay in AdamW. Experimental results validate our theory.

PUBLICATION RECORD

Publication year
2024
Venue
IEEE Transactions on Pattern Analysis and Machine Intelligence
Publication date
2024-03-27
Fields of study
Mathematics, Computer Science, Medicine
Identifiers
DOI 10.1109/TPAMI.2024.3382294 PMID 38536692
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Win: Weight-Decay-Integrated Nesterov Acceleration for Adaptive Gradient Algorithms
2023cited by this paper
Mugs: A Multi-Granular Self-Supervised Learning Framework
2022cited by this paper
A ConvNet for the 2020s
2022cited by this paper
On the SDEs and Scaling Rules for Adaptive Gradient Algorithms
2022cited by this paper
Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction
2022cited by this paper
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models
2022cited by this paper
A Novel Convergence Analysis for Algorithms of the Adam Family
2021influential reference
FixNorm: Dissecting Weight Decay for Training Deep Neural Networks
2021cited by this paper
Towards Understanding Why Lookahead Generalizes Better Than SGD and Beyond
2021cited by this paper
Emerging Properties in Self-Supervised Vision Transformers
2021cited by this paper
A Hybrid Stochastic-Deterministic Minibatch Proximal Gradient Method for Efficient Optimization and Generalization
2021cited by this paper
AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients
2020influential reference
Understanding Decoupled and Early Weight Decay
2020cited by this paper
Training data-efficient image transformers & distillation through attention
2020influential reference
A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization
2020cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020cited by this paper
Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning
2020influential reference
PyHessian: Neural Networks Through the Lens of the Hessian
2019cited by this paper
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
2019cited by this paper
An Investigation into Neural Net Optimization via Hessian Eigenvalue Density
2019cited by this paper
Escaping Saddle Points with Adaptive Gradient Methods
2019cited by this paper
A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks
2019cited by this paper
Lower bounds for non-convex stochastic optimization
2019influential reference
On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization
2018influential reference
On the Convergence of Adam and Beyond
2018cited by this paper
Norm matters: efficient and accurate normalization schemes in deep networks
2018cited by this paper
Understanding Generalization and Optimization Performance of Deep CNNs
2018cited by this paper
Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced
2018cited by this paper
Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks
2018cited by this paper
On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization
2018cited by this paper
The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects
2018cited by this paper
Three Mechanisms of Weight Decay Regularization
2018cited by this paper
Empirical Risk Landscape Analysis for Understanding Deep Neural Networks
2018cited by this paper
Faster First-Order Methods for Stochastic Non-Convex Optimization on Riemannian Manifolds
2018cited by this paper
A Bayesian Perspective on Generalization and Stochastic Gradient Descent
2017cited by this paper
Decoupled Weight Decay Regularization
2017cited by this paper
Algorithm-Dependent Generalization Bounds for Multi-Task Learning.
2017cited by this paper
Empirical Analysis of the Hessian of Over-Parametrized Neural Networks
2017cited by this paper
L2 Regularization versus Batch and Weight Normalization
2017cited by this paper
Stochastic Gradient Descent as Approximate Bayesian Inference
2017cited by this paper
Three Factors Influencing Minima in SGD
2017cited by this paper
Algorithm-Dependent Generalization Bounds for Multi-Task Learning
2017cited by this paper
A Variational Analysis of Stochastic Gradient Algorithms
2016cited by this paper
Diverse Neural Network Learns True Target Functions
2016cited by this paper
Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond
2016cited by this paper
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
2016cited by this paper
Identity Matters in Deep Learning
2016cited by this paper
Deep Residual Learning for Image Recognition
2015influential reference
Very Deep Convolutional Networks for Large-Scale Image Recognition
2014cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
New Insights and Perspectives on the Natural Gradient Method
2014cited by this paper
Going deeper with convolutions
2014cited by this paper
Convolutional Neural Networks for Speech Recognition
2014cited by this paper
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
2011cited by this paper
ImageNet: A large-scale hierarchical image database
2009cited by this paper
Estimation of Dependences Based on Empirical Data
2006cited by this paper
Statistical Inference Based on Divergence Measures
2005cited by this paper
Introduction to Numerical Analysis
2001cited by this paper
A Stochastic Approximation Method
1951cited by this paper

CITED BY

FisTopNet: A deep learning framework for automated estimation of evolving rock fracture network topology from image sequences
2026cites this paper
A TI-ADC calibration method based on structured pruning neural network with multi-tone signal pre-training
2026cites this paper
Measured acoustic energy and neural network models for low-speed water entry by steel spheres.
2026cites this paper
HomeAdam: Adam and AdamW Algorithms Sometimes Go Home to Obtain Better Provable Generalization
2026cites this paper
Simultaneous person attribute recognition using task-specific attention network on embedded devices
2026cites this paper
Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics
2026cites this paper
Sparse Prior Guided Decoder Network for Medical Image Segmentation
2026cites this paper
DCM-Net: A Dual-Branch Cross-Scale Guidance Mamba Network for Lithium-Ion Battery State of Health Estimation
2026cites this paper
Machine learning applied to the optimization of biomass char-based supercapacitors: Effect of Experimental Parameters on Supercapacitor Performance
2026cites this paper
PVMNet: A navel orange defect detection algorithm based on Mamba structure
2026cites this paper
Experimental and machine learning-based predictions of thermohydraulic performance in gasketed plate heat exchangers with three corrugated plate geometries
2026cites this paper
Enhanced UAV Target Recognition via YOLOv8-SR with Optimized Super-Resolution and Hyperparameters
2026cites this paper
LTPNet: Lesion-Aware Triple-Path Feature Fusion Network for Skin Lesion Segmentation
2026cites this paper
DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models
2026cites this paper
Prompt-guided dual-channel attention model predicts brain activation from functional and structural profiles
2026cites this paper
Evaluating a novel incremental-input neural network for multivariate air temperature forecasting
2026cites this paper
GTSegNet: An Island Coastline Segmentation Model Based on Collaborative Perception Strategy
2026cites this paper
Variation-aware Flexible 3D Gaussian Editing
2026cites this paper
ZR2ViM: a recursive vision Mamba model for boundary-preserving medical image segmentation
2026cites this paper
LRMTL: An efficient multitask learning network for breast tumor segmentation and medical report generation
2026cites this paper
Phosphate-functionalized MOS gas sensor for parts per billion-level acetone detection: A CBAM-GSABP neural network approach overcoming humidity interference
2026cites this paper
Study on the detection of pulverized coal and silica impurities in air pipeline based on improved U-Net
2026cites this paper
Multi-Scale ConvNeXt for Robust Brain Tumor Segmentation in Multimodal MRI
2026cites this paper
Cells on Autopilot: Adaptive Cell (Re)Selection via Reinforcement Learning
2026cites this paper
Chinese Immune Multi-Omics Atlas.
2026cites this paper
Curriculum-guided divergence scheduling improves single-cell clustering robustness
2026cites this paper
Data-driven constitutive modeling of rock mass: Internal variables and finite element simulation
2026cites this paper
Intelligent Evaluation of Rice Resistance to White-Backed Planthopper (Sogatella furcifera) Based on 3D Point Clouds and Deep Learning
2026influential citation
A retention-based method for explosive classification using broadband lightsource X-ray absorption spectroscopy (BL-XAS)
2026cites this paper
A Deep Learning-Based Graphical User Interface for Predicting Corneal Ectasia Scores from Raw Optical Coherence Tomography Data
2026influential citation
Uncertainty-aware variational soft prompt for few-shot tuning in fMRI-based mental state decoding
2026cites this paper
Decoding Gender in Cough Sounds: A Transformer-Based Analysis.
2026cites this paper
A hybrid self-supervised teacher-student model for predicting neurovascular bundle preservation in prostatectomy videos.
2026cites this paper
Hybrid Otsu Morphological Pre-processing for EfficientNetB4 Based Acute Lymphoblastic Leukemia Classification
2026cites this paper
Anomalous data diagnosis in bridge strain monitoring by fusing multi-modal data feature
2026cites this paper
The Effect of Mini-Batch Noise on the Implicit Bias of Adam
2026cites this paper
MAF-RecNet: A Lightweight Wheat and Corn Recognition Model Integrating Multiple Attention Mechanisms
2026cites this paper
X2P-Net: Context-Aware 2D/3D Vertebra Localization.
2026influential citation
Enhancing Genetic Algorithms with Graph Neural Networks: A Timetabling Case Study
2026cites this paper
Sparse Layer Sharpness-Aware Minimization for Efficient Fine-Tuning
2026cites this paper
Weight Decay Improves Language Model Plasticity
2026cites this paper
Vertical-UNet: A Deep Learning Framework for Vertical Structure Classification of Precipitation Clouds Using Multi-Source Satellite Data
2026cites this paper
LA-YOLO: a lightweight and accurate deep learning model for welding defect detection
2025cites this paper
Retinal Vessel Segmentation Based on a Lightweight U-Net and Reverse Attention
2025cites this paper
CIPD-Net: A pod detection method based on cascade deep learning network
2025cites this paper
High-efficiency spatially guided learning network for lymphoblastic leukemia detection in bone marrow microscopy images
2025cites this paper
A truth inference scheme for crowdsourcing using NLP and swin transformers
2025cites this paper
SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models
2025cites this paper
Physics-informed unsupervised neural network for parametric analysis and reconstruction via scatterometry evaluation (PUNN-PARSE) of grating structures on extreme ultra-violet collector mirror
2025cites this paper
Range-Angle Likelihood Maps for Indoor Positioning Using Deep Neural Networks
2025cites this paper
Explainable Learning Rate Regimes for Stochastic Optimization
2025cites this paper
UssNet: a spatial self-awareness algorithm for wheat lodging area detection
2025cites this paper
Prediction of functional outcomes in aneurysmal subarachnoid hemorrhage using pre-/postoperative noncontrast CT within 3 days of admission
2025cites this paper
A Multi-modal Sentiment Analysis Model Based on RoBERTa-AOBERT
2025cites this paper
Adapting WavLM for Vietnamese Speaker Diarization in Real-world Conversations
2025cites this paper
Enhancing the Efficiency of Secret Detection in Version Control Systems Using Machine Learning Methods
2025cites this paper
Loose Oil Palm Fruitlet Detection Using YOLOv8 Nano Model By Layer Freezing Training for Better Accuracy with Small Dataset
2025cites this paper
Heart disease risk prediction based on deep learning multi-scale convolutional enhanced Swin Transformer model.
2025cites this paper
Performance Evaluation and Accelerated Optimization of 4H‐SiC Power Devices Based on Neural Networks
2025cites this paper
Frequency Hopping Signal Parameter Estimation Algorithm Based on Time-Frequency Features and Semantic Segmentation
2025cites this paper
Multi-Scale Deep Learning for Kiln Head Temperature Prediction in Cement Production
2025cites this paper
QiGSAN: A Novel Probability-Informed Approach for Small Object Segmentation in the Case of Limited Image Datasets
2025cites this paper
The Segmentation of Unnamed Aerial Vehicle-Derived Aerial Photographs to Identify Anthropogenic Changes
2025cites this paper
Security script arrangement based on enhanced BERT for cooperative defense in networked control systems
2025cites this paper
Hyperspectral Imaging-Based Deep Learning Method for Detecting Quarantine Diseases in Apples
2025cites this paper
Multi-branch and multi-label tree species classification using deep learning for UAV aerial photography and Sentinel remote sensing images
2025cites this paper
Text Adversarial Attacks with Dynamic Outputs
2025cites this paper
Conda: Column-Normalized Adam for Training Large Language Models Faster
2025cites this paper
VeGA: A Versatile Generative Architecture for Bioactive Molecules across Multiple Therapeutic Targets
2025cites this paper
Research on Dan Character Peking Opera Costume Classification Based on Improved ResNet-18
2025cites this paper
Probing Whisper for Dysarthric Speech in Detection and Assessment
2025cites this paper
A teacher-student reinforcement learning framework for temperature control in solar greenhouses
2025cites this paper
Spatial Annotation-Free Sound Event Localization and Detection via Spatial Instance Classification
2025influential citation
SamMIL: Semi-Supervised Adaptive Multiple Instance Learning for Precise HP Grading in Gastric Pathology
2025cites this paper
DDA-GPR: symmetric adversarial domain adaptation for few-shot underground cavity identification in urban roads
2025cites this paper
A Spatiotemporal Flight Trajectory Prediction and Online Learning Framework Based on Integrated Transformer-Bidirectional Gated Recurrent Unit
2025cites this paper
Graph neural networks to predict coercivity and maximum energy product of hard magnetic microstructures
2025cites this paper
Robust Layerwise Scaling Rules by Proper Weight Decay Tuning
2025cites this paper
Smart solar power prediction using an FFT-Infused ShuffleNet regressor: A high-accuracy lightweight framework
2025cites this paper
Optimizer-Aware Fine-Tuning of Whisper Small with Low-Rank Adaption: An Empirical Study of Adam and AdamW
2025cites this paper
Large language model-based task planning for service robots: A review
2025cites this paper
Deep Learning-Based Phase Aberration Estimation for Ultra-Precise Silicon Wafer Metrology
2025cites this paper
Detail-DETR: End-to-End Small Object Detection in High Resolution Drone Images
2025influential citation
Optimizing a Gemstone Recognition Model Using YOLOv8 and the CBAM Attention Mechanism
2025cites this paper
EmbryoVision AI: An explainable deep learning framework for enhanced blastocyst selection in assisted reproductive technologies
2025cites this paper
FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models
2025cites this paper
VIOSem: Visual-Inertial Odometry via Semantic Communication-enhanced Modulation Design
2025cites this paper
Research on Geographical Origin Traceability of Salvia miltiorrhiza by Combining Two-Trace Two-Dimensional (2T2D) Correlation Spectroscopy and Improved DeiT Model
2025influential citation
Integrating protein sequence embeddings with structure via graph-based deep learning for single-residue property prediction
2025cites this paper
Deep learning-driven heat load prediction: Investigating the impacts of optimizer and learning rate scheduler strategies
2025cites this paper
Comparative analysis of pretrained encoders in U-Net for the estimation of fetal head circumference in ultrasound scans.
2025cites this paper
Real-Time Vehicle Detection Using YOLOv8 for Urban Traffic Congestion Analysis: A Case Study at Wandegeya Junction
2025cites this paper
STARS: Semantics-Aware Text-guided Aerial Image Refinement and Synthesis
2025cites this paper
基于几何特征增强和矩阵关联的点云配准网络
2025cites this paper
SMG-Net: A lightweight modular architecture for fine-grained crack segmentation in ancient wooden structures
2025cites this paper
Satellite image processing in the circumpolar north: Understanding climate crisis by predicting sea ice extent in the arctic
2025cites this paper
Small-Sample-Size Trait Imputation Using Deep-Learning Techniques.
2025cites this paper
Optimizing MobileNet-Based Models for Iris Recognition using LoRA, Pruning, and Quantization
2025cites this paper
Data-Driven Co-Optimization of Multiple Structural Parameters for the Combustion Chamber in a Coke Oven with a Multi-Stage Air Supply System
2025cites this paper
AtomicVAD: A tiny voice activity detection model for efficient inference in intelligent IoT systems
2025cites this paper