AdamW modifies Adam by adding a decoupled weight decay to decay network weights per training iteration. For adaptive algorithms, this decoupled weight decay does not affect specific optimization steps, and differs from the widely used <inline-formula><tex-math notation="LaTeX">$\ell _{2}$</tex-math><alternatives><mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="zhou-ieq1-3382294.gif"/></alternatives></inline-formula>-regularizer which changes optimization steps via changing the first- and second-order gradient moments. Despite its great practical success, for AdamW, its convergence behavior and generalization improvement over Adam and <inline-formula><tex-math notation="LaTeX">$\ell _{2}$</tex-math><alternatives><mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="zhou-ieq2-3382294.gif"/></alternatives></inline-formula>-regularized Adam (<inline-formula><tex-math notation="LaTeX">$\ell _{2}$</tex-math><alternatives><mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="zhou-ieq3-3382294.gif"/></alternatives></inline-formula>-Adam) remain absent yet. To solve this issue, we prove the convergence of AdamW and justify its generalization advantages over Adam and <inline-formula><tex-math notation="LaTeX">$\ell _{2}$</tex-math><alternatives><mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="zhou-ieq4-3382294.gif"/></alternatives></inline-formula>-Adam. Specifically, AdamW provably converges but minimizes a dynamically regularized loss that combines vanilla loss and a dynamical regularization induced by decoupled weight decay, thus yielding different behaviors with Adam and <inline-formula><tex-math notation="LaTeX">$\ell _{2}$</tex-math><alternatives><mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="zhou-ieq5-3382294.gif"/></alternatives></inline-formula>-Adam. Moreover, on both general nonconvex problems and PŁ-conditioned problems, we establish stochastic gradient complexity of AdamW to find a stationary point. Such complexity is also applicable to Adam and <inline-formula><tex-math notation="LaTeX">$\ell _{2}$</tex-math><alternatives><mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="zhou-ieq6-3382294.gif"/></alternatives></inline-formula>-Adam, and improves their previously known complexity, especially for over-parametrized networks. Besides, we prove that AdamW enjoys smaller generalization errors than Adam and <inline-formula><tex-math notation="LaTeX">$\ell _{2}$</tex-math><alternatives><mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="zhou-ieq7-3382294.gif"/></alternatives></inline-formula>-Adam from the Bayesian posterior aspect. This result, for the first time, explicitly reveals the benefits of decoupled weight decay in AdamW. Experimental results validate our theory.
Towards Understanding Convergence and Generalization of AdamW
Pan Zhou,Xingyu Xie,Zhouchen Lin,Shuicheng Yan
Published 2024 in IEEE Transactions on Pattern Analysis and Machine Intelligence
ABSTRACT
PUBLICATION RECORD
- Publication year
2024
- Venue
IEEE Transactions on Pattern Analysis and Machine Intelligence
- Publication date
2024-03-27
- Fields of study
Mathematics, Computer Science, Medicine
- Identifiers
- External record
- Source metadata
Semantic Scholar, PubMed
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-59 of 59 references · Page 1 of 1