Train faster, generalize better: Stability of stochastic gradient descent

Published 2015 in International Conference on Machine Learning

ABSTRACT

We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically stable in the sense of Bousquet and Elisseeff. Our analysis only employs elementary tools from convex and continuous optimization. We derive stability bounds for both convex and non-convex optimization under standard Lipschitz and smoothness assumptions. Applying our results to the convex case, we provide new insights for why multiple epochs of stochastic gradient methods generalize well in practice. In the non-convex case, we give a new interpretation of common practices in neural networks, and formally show that popular techniques for training large deep models are indeed stability-promoting. Our findings conceptually underscore the importance of reducing training time beyond its obvious benefit.

PUBLICATION RECORD

Publication year
2015
Venue
International Conference on Machine Learning
Publication date
2015-09-03
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1509.01240
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods
2017cited by this paper
Introduction to Optimization
2016cited by this paper
On the Generalization Properties of Differential Privacy
2015cited by this paper
Generalization Bounds for Neural Networks through Tensor Factorization
2015cited by this paper
Simple, Efficient, and Neural Algorithms for Sparse Coding
2015cited by this paper
Non-stochastic Best Arm Identification and Hyperparameter Optimization
2015cited by this paper
Introductory Lectures on Convex Optimization - A Basic Course
2014cited by this paper
Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints
2014cited by this paper
On the Computational Efficiency of Training Neural Networks
2014cited by this paper
Competing with the Empirical Risk Minimizer in a Single Pass
2014cited by this paper
In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning
2014cited by this paper
Dropout: a simple way to prevent neural networks from overfitting
2014cited by this paper
Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization
2014cited by this paper
Learning with Incremental Iterative Regularization
2014cited by this paper
Going deeper with convolutions
2014cited by this paper
Recurrent Neural Network Regularization
2014cited by this paper
First-order methods of smooth convex optimization with inexact oracle
2013cited by this paper
Proximal Algorithms
2013influential reference
Stochastic Approximation approach to Stochastic Programming
2013cited by this paper
Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming
2013cited by this paper
Almost Optimal Exploration in Multi-Armed Bandits
2013cited by this paper
ImageNet classification with deep convolutional neural networks
2012cited by this paper
An optimal method for stochastic composite optimization
2011cited by this paper
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization
2011cited by this paper
Learnability, Stability and Uniform Convergence
2010cited by this paper
The Tradeoffs of Large Scale Learning
2007cited by this paper
Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization
2006cited by this paper
Logarithmic regret algorithms for online convex optimization
2006cited by this paper
A simple recursive numerical method for Bermudan option pricing under Lévy processes
2006cited by this paper
Learnability
2005cited by this paper
Signal Recovery by Proximal Forward-Backward Splitting
2005cited by this paper
Stability of Randomized Learning Algorithms
2005cited by this paper
Stochastic Approximation and Recursive Algorithms and Applications
2003cited by this paper
Introductory Lectures on Convex Optimization
2003cited by this paper
Stability and Generalization
2002cited by this paper
Large Margin Classification Using the Perceptron Algorithm
1998cited by this paper
Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation
1997cited by this paper
Weak convergence and local stability properties of fixed step size recursive algorithms
1993cited by this paper
Building a Large Annotated Corpus of English: The Penn Treebank
1993influential reference
A Simple Weight Decay Can Improve Generalization
1991cited by this paper
Adaptive Algorithms and Stochastic Approximations
1990cited by this paper
Introduction to optimization
1987influential reference
Learning internal representations by error propagation
1986cited by this paper
Problem Complexity and Method Efficiency in Optimization
1983cited by this paper
Distribution-free performance bounds for potential function rules
1979influential reference
Monotone Operators and the Proximal Point Algorithm
1976influential reference

CITED BY

Sequential Subspace Noise Injection Prevents Accuracy Collapse in Certified Unlearning
2026cites this paper
What Do Learned Models Measure?
2026cites this paper
HomeAdam: Adam and AdamW Algorithms Sometimes Go Home to Obtain Better Provable Generalization
2026influential citation
Towards A Unified PAC-Bayesian Framework for Norm-based Generalization Bounds
2026cites this paper
Weight Decay Improves Language Model Plasticity
2026cites this paper
Training Memory in Deep Neural Networks: Mechanisms, Evidence, and Measurement Gaps
2026cites this paper
Adaptive Dual-Weighting Framework for Federated Learning via Out-of-Distribution Detection
2026cites this paper
Generalization Bounds of Stochastic Gradient Descent in Homogeneous Neural Networks
2026influential citation
Supervised Learning as Lossy Compression: Characterizing Generalization and Sample Complexity via Finite Blocklength Analysis
2026influential citation
Penalizing Localized Dirichlet Energies in Low Rank Tensor Products
2026cites this paper
Stability and Generalization of Nonconvex Optimization with Heavy-Tailed Noise
2026cites this paper
TRACE: Theoretical Risk Attribution under Covariate-shift Effects
2026cites this paper
Stability and Generalization of Push-Sum Based Decentralized Optimization over Directed Graphs
2026influential citation
Membership Inference Attacks from Causal Principles
2026cites this paper
Towards a Theoretical Understanding to the Generalization of RLHF
2026cites this paper
Understanding Model Merging: A Unified Generalization Framework for Heterogeneous Experts
2026cites this paper
All ERMs Can Fail in Stochastic Convex Optimization Lower Bounds in Linear Dimension
2026cites this paper
Model Agreement via Anchoring
2026cites this paper
On the Geometric Coherence of Global Aggregation in Federated GNN
2026cites this paper
Pool-based Active Learning as Noisy Lossy Compression: Characterizing Label Complexity via Finite Blocklength Analysis
2026influential citation
Quantum-stable robust principal component analysis: theory and evidence from NISQ regimes
2026cites this paper
Stable Source Coding
2026cites this paper
SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines
2026cites this paper
HeterCSI: Channel-Adaptive Heterogeneous CSI Pretraining Framework for Generalized Wireless Foundation Models
2026cites this paper
A Unified Matrix-Spectral Framework for Stability and Interpretability in Deep Learning
2026cites this paper
A Function-Space Stability Boundary for Generalization in Interpolating Learning Systems
2026influential citation
Conformal Risk Control for Non-Monotonic Losses
2026cites this paper
Projected Hessian Learning: Fast Curvature Supervision for Accurate Machine-Learning Interatomic Potentials
2026cites this paper
Sufficient Conditions for Stability of Minimum-Norm Interpolating Deep ReLU Networks
2026cites this paper
On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature
2026cites this paper
Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization
2026cites this paper
Learning to Optimize by Differentiable Programming
2026cites this paper
On the Theory of Continual Learning with Gradient Descent for Neural Networks
2025cites this paper
On The Statistical Limits of Self-Improving Agents
2025cites this paper
Beyond Real Data: Synthetic Data through the Lens of Regularization
2025cites this paper
Feature Dynamics as Implicit Data Augmentation: A Depth-Decomposed View on Deep Neural Network Generalization
2025cites this paper
Stability and Generalization for Stochastic (Compositional) Optimizations
2025influential citation
OrthAlign: Orthogonal Subspace Decomposition for Non-Interfering Multi-Objective Alignment
2025cites this paper
Unveiling the Power of Multiple Gossip Steps: A Stability-Based Generalization Analysis in Decentralized Training
2025influential citation
RL-Guided Data Selection for Language Model Finetuning
2025cites this paper
Approximation, Estimation and Optimization Errors for a Deep Neural Network
2025cites this paper
Beyond Ordinary Lipschitz Constraints: Differentially Private Stochastic Optimization with Tsybakov Noise Condition
2025cites this paper
Convergence and Generalization of Anti-Regularization for Parametric Models
2025cites this paper
Is Exchangeability better than I.I.D to handle Data Distribution Shifts while Pooling Data for Data-scarce Medical image segmentation?
2025cites this paper
Effective Sample Size and Generalization Bounds for Temporal Networks
2025cites this paper
Feature loop consistency optimization for enhanced control precision in text-to-image generation
2025cites this paper
Heart Disease Prediction: A Comparative Study of Optimisers Performance in Deep Neural Networks
2025cites this paper
The Relative Instability of Model Comparison with Cross-validation
2025cites this paper
Decoding 3D Geometry: Deep Networks for Mesh Classification
2025cites this paper
Frozen in Time: Parameter-Efficient Time Series Transformers via Reservoir-Induced Feature Expansion and Fixed Random Dynamics
2025cites this paper
Generalization Bound for a General Class of Neural Ordinary Differential Equations
2025cites this paper
Stability and Generalization for Bellman Residuals
2025cites this paper
PAC–Bayes Guarantees for Data-Adaptive Pairwise Learning
2025cites this paper
On the MIA Vulnerability Gap Between Private GANs and Diffusion Models
2025cites this paper
Deep learning for simulating the evolution of condensed matter systems at the continuum scale: methods and applications
2025cites this paper
Generalization and Optimization of SGD with Lookahead
2025cites this paper
Adversarial Training for Graph Convolutional Networks: Stability and Generalization Analysis
2025influential citation
Stability and Generalization of Adversarial Diffusion Training
2025cites this paper
Structured light meets machine intelligence
2025cites this paper
Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification
2025cites this paper
Spectral Thresholds for Identifiability and Stability:Finite-Sample Phase Transitions in High-Dimensional Learning
2025cites this paper
Calculation of entanglement entropy of transverse-field ising model from neural network quantum state based on a restricted Boltzmann machine
2025cites this paper
On the Alignment Between Supervised and Self-Supervised Contrastive Learning
2025cites this paper
Investigating the Role of Weight Decay in Enhancing Nonconvex SGD
2025cites this paper
Rapid Overfitting of Multi-Pass SGD in Stochastic Convex Optimization
2025influential citation
Stability, Complexity and Data-Dependent Worst-Case Generalization Bounds
2025influential citation
Byzantine Failures Harm the Generalization of Robust Distributed Learning Algorithms More Than Data Poisoning
2025influential citation
AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs
2025cites this paper
ICP: Immediate Compensation Pruning for Mid-to-high Sparsity
2025cites this paper
Implicit Regularisation in Diffusion Models: An Algorithm-Dependent Generalisation Analysis
2025influential citation
Stochastic optimization of large-scale parametrized dynamical systems
2025cites this paper
Faithful Group Shapley Value
2025cites this paper
Optimal Rates in Continual Linear Regression via Increasing Regularization
2025cites this paper
Convergence of Adam in Deep ReLU Networks via Directional Complexity and Kakeya Bounds
2025cites this paper
Gradient Descent as a Shrinkage Operator for Spectral Bias
2025cites this paper
Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization
2025influential citation
A tighter generalization error bound for wide GCN based on loss landscape
2025cites this paper
Generalization Error Analysis for Attack-Free and Byzantine-Resilient Decentralized Learning with Data Heterogeneity
2025influential citation
DOME: Improving Signal-to-Noise in Stochastic Gradient Descent via Sharp-Direction Subspace Filtering
2025cites this paper
Stability Regularized Cross-Validation
2025cites this paper
Online Learning and Unlearning
2025cites this paper
Accelerating Natural Gradient Descent for PINNs with Randomized Numerical Linear Algebra
2025cites this paper
An Approach to Finding a Robust Deep Learning Model
2025cites this paper
Temperature is All You Need for Generalization in Langevin Dynamics and other Markov Processes
2025cites this paper
Towards Understanding The Calibration Benefits of Sharpness-Aware Minimization
2025cites this paper
Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States
2025cites this paper
EasyViT: An Adaptive Collaborative Edge Computing Framework for Vision Transformer
2025cites this paper
Generalization Bound of Gradient Flow through Training Trajectory and Data-dependent Kernel
2025influential citation
Improving Sample Efficiency Through Stability Enhancement in Deep-Reinforcement Learning
2025influential citation
SynDaCaTE: A Synthetic Dataset For Evaluating Part-Whole Hierarchical Inference
2025cites this paper
Tight Generalization Error Bounds for Stochastic Gradient Descent in Non-convex Learning
2025cites this paper
Recent Advances in Optimization Methods for Machine Learning: A Systematic Review
2025cites this paper
Improved monocular depth prediction using distance transform over pre-semantic contours with self-supervised neural networks
2025cites this paper
Adversarial Data Augmentation for Single Domain Generalization via Lyapunov Exponent-Guided Optimization
2025cites this paper
Learning Latent Graph Geometry via Fixed-Point Schr\"odinger-Type Activation: A Theoretical Study
2025cites this paper
FedSWA: Improving Generalization in Federated Learning with Highly Heterogeneous Data via Momentum-Based Stochastic Controlled Weight Averaging
2025influential citation
Cognitively-plausible reinforcement learning in epidemiological agent-based simulations
2025cites this paper
UniERF: A Uniform Embedding-based Retrieval Framework for E-commerce Search
2025cites this paper
Asymptotic Consistency and Generalization in Hybrid Models of Regularized Selection and Nonlinear Learning
2025cites this paper
From Continual Learning to SGD and Back: Better Rates for Continual Linear Models
2025influential citation