Lookahead Optimizer: k steps forward, 1 step back

Michael Ruogu Zhang,James Lucas,Geoffrey E. Hinton,Jimmy Ba

Published 2019 in Neural Information Processing Systems

ABSTRACT

The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of fast weights generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.

PUBLICATION RECORD

Publication year
2019
Venue
Neural Information Processing Systems
Publication date
2019-07-19
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1907.08610
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
2019cited by this paper
Nonlinear Acceleration of Deep Neural Networks
2018cited by this paper
On the Ineffectiveness of Variance Reduced Optimization for Deep Learning
2018cited by this paper
Reptile: a Scalable Metalearning Algorithm
2018cited by this paper
Improving Language Understanding by Generative Pre-Training
2018cited by this paper
The Unusual Effectiveness of Averaging in GAN Training
2018cited by this paper
Nonlinear Acceleration of CNNs
2018cited by this paper
Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs
2018cited by this paper
Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes
2018cited by this paper
Understanding Short-Horizon Bias in Stochastic Meta-Optimization
2018cited by this paper
Averaging Weights Leads to Wider Optima and Better Generalization
2018cited by this paper
Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs
2018cited by this paper
Aggregated Momentum: Stability Through Passive Damping
2018cited by this paper
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
2018cited by this paper
On First-Order Meta-Learning Algorithms
2018cited by this paper
Decoupled Weight Decay Regularization
2017cited by this paper
Regularizing and Optimizing LSTM Language Models
2017cited by this paper
Fixing Weight Decay Regularization in Adam
2017cited by this paper
Why Momentum Really Works
2017cited by this paper
Attention is All you Need
2017influential reference
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
2017cited by this paper
ImageNet Training in Minutes
2017cited by this paper
Improved Regularization of Convolutional Neural Networks with Cutout
2017cited by this paper
Katyusha: the first direct acceleration of stochastic gradient methods
2016cited by this paper
Wide Residual Networks
2016cited by this paper
Optimizing Neural Networks with Kronecker-factored Approximate Curvature
2015cited by this paper
Deep Residual Learning for Image Recognition
2015influential reference
On Using Very Large Target Vocabulary for Neural Machine Translation
2014cited by this paper
New perspectives on the natural gradient method
2014cited by this paper
Adam: A Method for Stochastic Optimization
2014influential reference
New Insights and Perspectives on the Natural Gradient Method
2014cited by this paper
Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints
2014cited by this paper
On the importance of initialization and momentum in deep learning
2013cited by this paper
Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
2013influential reference
Adaptive Restart for Accelerated Gradient Schemes
2012cited by this paper
No more pesky learning rates
2012cited by this paper
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
2011cited by this paper
Sharpness in rates of convergence for the symmetric Lanczos method
2010cited by this paper
ImageNet: A large-scale hierarchical image database
2009influential reference
Learning Multiple Layers of Features from Tiny Images
2009influential reference
05-01 Sharpness in Rates of Convergence For CG and Symmetric Lanczos Methods
2005cited by this paper
Neural Networks: Tricks of the Trade
2002cited by this paper
Long Short-Term Memory
1997cited by this paper
Extrapolation methods: theory and practice
1993cited by this paper
Building a Large Annotated Corpus of English: The Penn Treebank
1993cited by this paper
Acceleration of stochastic approximation by averaging
1992cited by this paper
Efficient Estimations from a Slowly Convergent Robbins-Monro Process
1988cited by this paper
Using fast weights to deblur old memories
1987cited by this paper
A method for solving the convex programming problem with convergence rate O(1/k^2)
1983cited by this paper
Iterative Procedures for Nonlinear Integral Equations
1965cited by this paper
Some methods of speeding up the convergence of iteration methods
1964cited by this paper

CITED BY

TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers
2026cites this paper
Unsupervised Modular Adaptive Region Growing and RegionMix Classification for Wind Turbine Segmentation
2026cites this paper
Dual-branch time-frequency network with channel masking for automated insect sound monitoring toward field environments
2026cites this paper
Enabling Progressive Whole-slide Image Analysis with Multi-scale Pyramidal Network
2026cites this paper
EMA Policy Gradient: Taming Reinforcement Learning for LLMs with EMA Anchor and Top-k KL
2026cites this paper
Adversarial Example Generation for Infrared Images
2026cites this paper
Fractional-order gradient descent method based on fractional-order term exponential decay and its application in artificial neural networks
2026cites this paper
Parent-Guided Adaptive Reliability (PGAR): A Behavioural Meta-Learning Framework for Stable and Trustworthy AI
2026cites this paper
Neighborhood and Global Perturbations Supported Sharpness-Aware Minimization in Federated Learning: From Local Tweaks to Global Awareness
2026cites this paper
LipNeXt: Scaling up Lipschitz-based Certified Robustness to Billion-parameter Models
2026cites this paper
Optimization Algorithms for Brain MRI-Based Alzheimer’s Disease Classification: A Comprehensive Review and Methodological Framework
2026cites this paper
Training Memory in Deep Neural Networks: Mechanisms, Evidence, and Measurement Gaps
2026cites this paper
Self-Supervised Continual Learning for SAR-ATR: A Local Feature Adaptation Framework
2026cites this paper
Leap+Verify: Regime-Adaptive Speculative Weight Prediction for Accelerating Neural Network Training
2026cites this paper
Multi-level perception cross-modal fusion framework for multimodal sentiment analysis
2026influential citation
Automatic Stability and Recovery for Neural Network Training
2026cites this paper
TruKAN: Towards More Efficient Kolmogorov-Arnold Networks Using Truncated Power Functions
2026cites this paper
Frequency-Based Hyperparameter Selection in Games
2026influential citation
Multitask learning via task embeddings for glass property prediction with improved sample efficiency
2026cites this paper
Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition
2026cites this paper
TriTrackNet: A dual-channel time series forecasting model with multi-path interaction and perturbation optimization
2025cites this paper
Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs
2025cites this paper
Improved GAN's with Game Theory Training for Audio Generation
2025cites this paper
Unifying Sign and Magnitude for Optimizing Deep Vision Networks via ThermoLion
2025cites this paper
A Risk-Neutral Neural Operator for Arbitrage-Free SPX-VIX Term Structures
2025cites this paper
Beyond Accuracy: The Role of Calibration in Computational Pathology
2025cites this paper
Beyond Adam: Disentangling Optimizer Effects in the Fine-Tuning of Atomistic Foundation Models
2025cites this paper
Semantic image transmission via GAN inversion-driven channel-joint semantic communication
2025influential citation
Classifying long legal documents using short random chunks
2025cites this paper
AgriFormer: Advancing 3D LiDAR-based Biomass Prediction through Hierarchical Feature Learning
2025cites this paper
BDS-Adam optimizer integrating adaptive variance rectification with semi-adaptive gradient smoothing
2025cites this paper
Spectral momentum integration: hybrid optimization of frequency and time domain gradients
2025cites this paper
A benchmark study of optimizers for short-term solar PV power forecasting using neural networks under real-world constraints
2025cites this paper
MemristiveAdamW: An Optimization Algorithm for Spiking Neural Networks Incorporating Memristive Effects
2025cites this paper
Gradient Descent with Provably Tuned Learning-rate Schedules
2025cites this paper
Enhanced Interpretable Neural Network Approach for Unified Batch Effect Mitigation and Disease Classification Using Cross-Cohort Microbiome Profiles
2025cites this paper
Improving adversarial transferability via adaptive ensemble attack with post-optimization
2025cites this paper
StyleDemorpher: high-quality face demorphing via StyleGAN2’s latent space
2025cites this paper
PrimeNet: rational design of Prime editing pegRNAs by deep learning
2025cites this paper
NoLoCo: No-all-reduce Low Communication Training Method for Large Models
2025influential citation
A neural network based on back-propagation and cooperative co-evolution
2025cites this paper
Towards Understanding The Calibration Benefits of Sharpness-Aware Minimization
2025cites this paper
The Price Equation Reveals a Universal Force–Metric–Bias Law of Algorithmic Learning and Natural Selection
2025cites this paper
Differential Mamba
2025cites this paper
Transformer spectral optimization: From gradient frequency analysis to adaptive spectral integration
2025cites this paper
Neural Scaling Laws Surpass Chemical Accuracy for the Many-Electron Schr\"odinger Equation
2025cites this paper
Short-Term Solar PV Power Forecasting: A Comparative Analysis of Neural Network Optimization Techniques
2025cites this paper
Development of Deep Learning Optimizers: Approaches, Concepts, and Update Rules
2025cites this paper
Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning
2025cites this paper
G2LFormer: Global-to-Local Query Enhancement for Robust Table Structure Recognition
2025cites this paper
Gradient Flow Matching for Learning Update Dynamics in Neural Network Training
2025cites this paper
WarpGAN: Warping-Guided 3D GAN Inversion with Style-Based Novel View Inpainting
2025cites this paper
FracGrad: A Discretized Riemann–Liouville Fractional Integral Approach to Gradient Accumulation for Deep Learning
2025cites this paper
Stride conversion algorithms for convolutional layers and its application to sampling-frequency-independent deep neural networks
2025cites this paper
Enhancing next token prediction based pre-training for jet foundation models
2025cites this paper
Investigating Mask-aware Prototype Learning for Tabular Anomaly Detection
2025cites this paper
Hindsight-Guided Momentum (HGM) Optimizer: An Approach to Adaptive Learning Rate
2025influential citation
Adaptive exploration and temporal attention in reinforcement learning for autonomous air combat decision making
2025cites this paper
FedEve: On Bridging the Client Drift and Period Drift for Cross-device Federated Learning
2025cites this paper
Accelerating Learned Image Compression Through Modeling Neural Training Dynamics
2025influential citation
Optimizing on-demand ride-hailing services in two-sided coupled markets with impatient riders
2025cites this paper
A novel Neural-ODE model for the state of health estimation of lithium-ion battery using charging curve
2025influential citation
A Deep Learning-Based Approach for Cell Segmentation in Phase-Contrast Images
2025cites this paper
PanSpace: Fast and Scalable Indexing for Massive Bacterial Databases
2025cites this paper
Analyzing and Optimizing Perturbation of DP-SGD Geometrically
2025cites this paper
Dynamic bound adaptive gradient methods with belief in observed gradients
2025cites this paper
Federated Learning via Meta-Variational Dropout
2025cites this paper
A lightweight coal-gangue detection model based on parallel deep residual networks
2025cites this paper
Hierarchical Semantic Compression for Consistent Image Semantic Restoration
2025influential citation
Actions Speak Louder Than Words: Rate-Reward Trade-off in Markov Decision Processes
2025cites this paper
Frankenstein Optimizer: Harnessing the Potential by Revisiting Optimization Tricks
2025cites this paper
Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo
2025cites this paper
DGSAM: Domain Generalization via Individual Sharpness-Aware Minimization
2025cites this paper
BSMatch: Boundary Segmentation and Matching for Lipid Droplet Quantification in Diagnosis of Non-Alcoholic Fatty Liver Disease
2025cites this paper
Semi-supervised semantic segmentation of cell nuclei with diffusion model and collaborative learning
2025cites this paper
HVAdam: A Full-Dimension Adaptive Optimizer
2025cites this paper
Bridging Domain Gaps in Computational Pathology: A Comparative Study of Adaptation Strategies
2025cites this paper
A computer vision-based approach for identification of non-metallic inclusions in the steel industry products
2025cites this paper
A Unified Gradient-based Framework for Task-agnostic Continual Learning-Unlearning
2025cites this paper
Enhancing Certified Robustness via Block Reflector Orthogonal Layers and Logit Annealing Loss
2025cites this paper
Input normalized stochastic gradient descent for language tasks
2025cites this paper
MuLoCo: Muon is a practical inner optimizer for DiLoCo
2025influential citation
Local Equivariance Error-Based Metrics for Evaluating Sampling-Frequency-Independent Property of Neural Network
2025cites this paper
SASFNet: Soft-edge awareness and spatial-attention feedback deep network for blind image deblurring
2025cites this paper
Quantum-Inspired Differentiable Integral Neural Networks (QIDINNs): A Feynman-Based Architecture for Continuous Learning Over Streaming Data
2025cites this paper
MSMMIL: Multi-scan Mamba-based Multiple Instance Learning for whole slide image classification
2025cites this paper
Low-Complexity Semantic Packet Aggregation for Token Communication via Lookahead Search
2025cites this paper
Heart rate and respiratory rate prediction from noisy real-world smartphone based on Deep Learning methods
2025cites this paper
Recent Advances in Optimization Methods for Machine Learning: A Systematic Review
2025influential citation
Predicting Flow-Induced Vibration in Isolated and Tandem Cylinders Using Hypergraph Neural Networks
2025cites this paper
Cracking Instance Jigsaw Puzzles: An Alternative to Multiple Instance Learning for Whole Slide Image Analysis
2025cites this paper
FedSWA: Improving Generalization in Federated Learning with Highly Heterogeneous Data via Momentum-Based Stochastic Controlled Weight Averaging
2025cites this paper
Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models
2025cites this paper
Welcome New Doctor: Continual Learning with Expert Consultation and Autoregressive Inference for Whole Slide Image Analysis
2025cites this paper
LadderMIL: Multiple Instance Learning with Coarse-to-Fine Self-Distillation
2025cites this paper
A transient stability assessment method for power systems incorporating residual networks and BiGRU-Attention
2025cites this paper
Generalization and Optimization of SGD with Lookahead
2025influential citation
AdaR: An Adaptive Gradient Method with Cyclical Restarting of Moment Estimations
2025influential citation
OmniJet-α_C: learning point cloud calorimeter simulations using generative transformers
2025cites this paper
Sine and cosine based learning rate for gradient descent method
2025influential citation