A Variational Analysis of Stochastic Gradient Algorithms

Published 2016 in International Conference on Machine Learning

ABSTRACT

Stochastic Gradient Descent (SGD) is an important algorithm in machine learning. With constant learning rates, it is a stochastic process that, after an initial phase of convergence, generates samples from a stationary distribution. We show that SGD with constant rates can be effectively used as an approximate posterior inference algorithm for probabilistic modeling. Specifically, we show how to adjust the tuning parameters of SGD such as to match the resulting stationary distribution to the posterior. This analysis rests on interpreting SGD as a continuous-time stochastic process and then minimizing the Kullback-Leibler divergence between its stationary distribution and the target posterior. (This is in the spirit of variational inference.) In more detail, we model SGD as a multivariate Ornstein-Uhlenbeck process and then use properties of this process to derive the optimal parameters. This theoretical framework also connects SGD to modern scalable inference algorithms; we analyze the recently proposed stochastic gradient Fisher scoring under this perspective. We demonstrate that SGD with properly chosen constant rates gives a new way to optimize hyperparameters in probabilistic models.

PUBLICATION RECORD

Publication year
2016
Venue
International Conference on Machine Learning
Publication date
2016-02-08
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1602.02666
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Pattern Recognition And Machine Learning
2016influential reference
Towards Stability and Optimality in Stochastic Gradient Descent
2015cited by this paper
From Averaging to Acceleration, There is Only a Step-size
2015cited by this paper
A Complete Recipe for Stochastic Gradient MCMC
2015influential reference
On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators
2015cited by this paper
Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms
2015cited by this paper
Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization
2015cited by this paper
Automatic Variational Inference in Stan
2015influential reference
Early Stopping is Nonparametric Variational Inference
2015cited by this paper
Dynamics of Stochastic Gradient Algorithms
2015cited by this paper
Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions
2015cited by this paper
Statistical analysis of stochastic gradient methods for generalized linear models
2014cited by this paper
Bayesian Sampling Using Stochastic Gradient Thermostats
2014cited by this paper
Approximation Analysis of Stochastic Gradient Langevin Dynamics by using Fokker-Planck Equation and Ito Process
2014cited by this paper
Stochastic Gradient Hamiltonian Monte Carlo
2014cited by this paper
Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)
2013influential reference
Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring
2012influential reference
Bayesian Learning via Stochastic Gradient Langevin Dynamics
2011influential reference
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
2011influential reference
Design and Analysis of Algorithms
2009cited by this paper
Pattern Recognition and Machine Learning
2006influential reference
Solving large scale linear prediction problems using stochastic gradient descent algorithms
2004cited by this paper
Stochastic Approximation and Recursive Algorithms and Applications
2003cited by this paper
An Introduction to Variational Methods for Graphical Models
1999cited by this paper
On-line learning and stochastic approximations
1999cited by this paper
Online Learning and Stochastic Approximations
1998cited by this paper
Stochastic approximation and optimization of random systems
1992cited by this paper
A fast scoring algorithm for maximum likelihood estimation in unbalanced mixed models with nested random effects
1987cited by this paper
Handbook of Stochastic Methods
1983cited by this paper
On the theory of brownian motion
1973influential reference
Efficient recursive estimation; application to estimating the parameters of a covariance function
1965cited by this paper
A Stochastic Approximation Method
1951cited by this paper
Brownian motion in a field of force and the diffusion model of chemical reactions
1940cited by this paper

CITED BY

Neural Networks as Entropic Systems: Applications in Digital Pathology
2026cites this paper
Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective
2026influential citation
Tail behavior of Markov-modulated generalized Ornstein-Uhlenbeck processes
2026cites this paper
GRADSTOP: Early Stopping of Gradient Descent via Posterior Sampling
2025cites this paper
Trajectory-Dependent Generalization Bounds for Pairwise Learning with φ-mixing Samples
2025cites this paper
FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models
2025cites this paper
TPV: Parameter Perturbations Through the Lens of Test Prediction Variance
2025cites this paper
Stochastic Variational Inference with Tuneable Stochastic Annealing
2025cites this paper
Optimization on multifractal loss landscapes explains a diverse range of geometrical and dynamical properties of deep learning
2025influential citation
Unbiased and Sign Compression in Distributed Learning: Comparing Noise Resilience via SDEs
2025cites this paper
Generalization Bounds for Markov Algorithms through Entropy Flow Computations
2025cites this paper
Models of Heavy-Tailed Mechanistic Universality
2025cites this paper
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
2025cites this paper
Stochastic gradient descent based variational inference for infinite-dimensional inverse problems
2025influential citation
Algorithm- and Data-Dependent Generalization Bounds for Diffusion Models
2025cites this paper
Adaptive stepsize algorithms for Langevin dynamics
2024cites this paper
Variation Due to Regularization Tractably Recovers Bayesian Deep Learning
2024cites this paper
Emergence of heavy tails in homogenized stochastic gradient descent
2024cites this paper
SGD vs GD: Rank Deficiency in Linear Networks
2024cites this paper
Soft Condorcet Optimization for Ranking of General Agents
2024cites this paper
Identifying Drift, Diffusion, and Causal Structure from Temporal Snapshots
2024cites this paper
Noise-Aware Differentially Private Variational Inference
2024influential citation
Distributed Stochastic Optimization with Random Communication and Computational Delays: Optimal Policies and Performance Analysis
2024cites this paper
An SDE Perspective on Stochastic Inertial Gradient Dynamics with Time-Dependent Viscosity and Geometric Damping
2024cites this paper
To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions
2024cites this paper
Distributed Stochastic Gradient Descent With Staleness: A Stochastic Delay Differential Equation Based Framework
2024influential citation
A Comparative Study of Classification Models for Cyberbullying Detection
2024cites this paper
Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling
2024cites this paper
Autism spectrum disorder diagnosis using fractal and non-fractal-based functional connectivity analysis and machine learning methods
2024cites this paper
Towards Understanding Convergence and Generalization of AdamW
2024cites this paper
Stochastic Inertial Dynamics via Time Scaling and Averaging
2024cites this paper
Analysing heavy-tail properties of Stochastic Gradient Descent by means of Stochastic Recurrence Equations
2024cites this paper
Stochastic Gradient Flow Dynamics of Test Risk and its Exact Solution for Weak Features
2024cites this paper
Understanding the Generalization Benefits of Late Learning Rate Decay
2024cites this paper
Revisiting the Noise Model of Stochastic Gradient Descent
2023cites this paper
Generalization Bounds using Data-Dependent Fractal Dimensions
2023cites this paper
Stochastic collapse: how gradient noise attracts SGD dynamics towards simpler subnetworks
2023cites this paper
(S)GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability
2023cites this paper
PCDP-SGD: Improving the Convergence of Differentially Private SGD via Projection in Advance
2023cites this paper
Weight fluctuations in (deep) linear neural networks and a derivation of the inverse-variance flatness relation
2023cites this paper
The Anytime Convergence of Stochastic Gradient Descent with Momentum: From a Continuous-Time Perspective
2023cites this paper
Revisiting Logistic-softmax Likelihood in Bayesian Meta-Learning for Few-Shot Classification
2023influential citation
RPCGB Method for Large-Scale Global Optimization Problems
2023cites this paper
A new characterization of the edge of stability based on a sharpness measure aware of batch gradient distribution
2023cites this paper
Dynamical convergence analysis for nonconvex linearized proximal ADMM algorithms
2023cites this paper
Generalization Bounds with Data-dependent Fractal Dimensions
2023cites this paper
Implicit Jacobian regularization weighted with impurity of probability output
2023cites this paper
Implicit Bias of (Stochastic) Gradient Descent for Rank-1 Linear Neural Network
2023cites this paper
(S)GD over Diagonal Linear Networks: Implicit Regularisation, Large Stepsizes and Edge of Stability
2023cites this paper
Robust Meta-learning with Sampling Noise and Label Noise via Eigen-Reptile
2022cites this paper
On Generalization Bounds for Deep Networks Based on Loss Surface Implicit Regularization
2022cites this paper
Unifying supervised learning and VAEs - automating statistical inference in (astro-)particle physics with amortized conditional normalizing ﬂows
2022cites this paper
Convergence Rates for Stochastic Approximation on a Boundary
2022cites this paper
Why does SGD prefer ﬂat minima?: Through the lens of dynamical systems
2022cites this paper
A Review of Data‐Driven Discovery for Dynamic Systems
2022cites this paper
Improving information retention in large scale online continual learning
2022cites this paper
A Bayesian Approach for Spatio-Temporal Data-Driven Dynamic Equation Discovery
2022influential citation
A Bayesian Approach for Data-Driven Dynamic Equation Discovery
2022cites this paper
Distributed Learning with Strategic Users: A Repeated Game Approach
2022cites this paper
Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions
2022cites this paper
A Modified Stein Variational Inference Algorithm with Bayesian and Gradient Descent Techniques
2022cites this paper
Deep neural networks with dependent weights: Gaussian Process mixture limit, heavy tails, sparsity and compressibility
2022cites this paper
An automatic learning rate decay strategy for stochastic gradient descent optimization methods in neural networks
2022cites this paper
Stochastic Gradient Descent with Noise of Machine Learning Type Part II: Continuous Time Analysis
2021cites this paper
Rethinking the limiting dynamics of SGD: modified loss, phase space oscillations, and anomalous diffusion
2021influential citation
Exponential escape efficiency of SGD from sharp minima in non-stationary regime
2021cites this paper
FedCor: Correlation-Based Active Client Selection Strategy for Heterogeneous Federated Learning
2021cites this paper
Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models
2021cites this paper
A Continuous-time Stochastic Gradient Descent Method for Continuous Data
2021cites this paper
An Adaptive Learning Rate Schedule for SIGNSGD Optimizer in Neural Networks
2021cites this paper
Quasi-potential theory for escape problem: Quantitative sharpness effect on SGD's escape from local minima
2021cites this paper
Stationary probability distributions of stochastic gradient descent and the success and failure of the diffusion approximation
2021cites this paper
On the Hyperparameters in Stochastic Gradient Descent with Momentum
2021cites this paper
Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity
2021cites this paper
Generalization Properties of Stochastic Optimizers via Trajectory Analysis
2021cites this paper
Communication-Efficient Federated Learning via Predictive Coding
2021cites this paper
The Limiting Dynamics of SGD: Modified Loss, Phase-Space Oscillations, and Anomalous Diffusion
2021influential citation
Advanced Free-rider Attacks in Federated Learning
2021influential citation
FedGP: Correlation-Based Active Client Selection for Heterogeneous Federated Learning
2021influential citation
Structured Stochastic Gradient MCMC
2021cites this paper
SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality
2021cites this paper
Asymmetric Heavy Tails and Implicit Bias in Gaussian Noise Injections
2021influential citation
FedGP: Correlation-Based Active Client Selection Strategy for Heterogeneous Federated Learning
2021cites this paper
Generalization Bounds using Lower Tail Exponents in Stochastic Optimizers
2021cites this paper
Sampling Sparse Representations with Randomized Measurement Langevin Dynamics
2021cites this paper
Combining resampling and reweighting for faithful stochastic optimization
2021cites this paper
Unifying supervised learning and VAEs: coverage, systematics and goodness-of-fit in normalizing-flow based neural network models for astro-particle reconstructions
2020cites this paper
Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise
2020cites this paper
Online Learning in Contextual Bandits using Gated Linear Networks
2020cites this paper
On Learning Rates and Schrödinger Operators
2020cites this paper
Analysis of stochastic gradient descent in continuous time
2020cites this paper
Inherent Noise in Gradient Based Methods
2020cites this paper
The Heavy-Tail Phenomenon in SGD
2020cites this paper
Hausdorff Dimension, Stochastic Differential Equations, and Generalization in Neural Networks
2020influential citation
Communication-Efficient Federated Learning via Optimal Client Sampling
2020influential citation
Unifying supervised learning and VAEs - automating statistical inference in high-energy physics
2020cites this paper
Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning
2020influential citation
Hausdorff dimension, heavy tails, and generalization in neural networks
2020cites this paper
SpHMC: Spectral Hamiltonian Monte Carlo
2019cites this paper
Stochastic Gradient and Langevin Processes
2019cites this paper