Towards Stability and Optimality in Stochastic Gradient Descent

Published 2015 in International Conference on Artificial Intelligence and Statistics

ABSTRACT

Iterative procedures for parameter estimation based on stochastic gradient descent allow the estimation to scale to massive data sets. However, in both theory and practice, they suffer from numerical instability. Moreover, they are statistically inefficient as estimators of the true parameter value. To address these two issues, we propose a new iterative procedure termed averaged implicit SGD (AI-SGD). For statistical efficiency, AI-SGD employs averaging of the iterates, which achieves the optimal Cramer-Rao bound under strong convexity, i.e., it is an optimal unbiased estimator of the true parameter value. For numerical stability, AI-SGD employs an implicit update at each iteration, which is related to proximal operators in optimization. In practice, AI-SGD achieves competitive performance with other state-of-the-art procedures. Furthermore, it is more stable than averaging procedures that do not employ proximal updates, and is simple to implement as it requires fewer tunable hyperparameters than procedures that do employ proximal updates.

PUBLICATION RECORD

Publication year
2015
Venue
International Conference on Artificial Intelligence and Statistics
Publication date
2015-05-10
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1505.02417
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Implicit stochastic approximation
2015cited by this paper
Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions
2015cited by this paper
Scalable estimation strategies based on stochastic approximations: classical results and new insights
2015cited by this paper
Statistical analysis of stochastic gradient methods for generalized linear models
2014influential reference
Introductory Lectures on Convex Optimization - A Basic Course
2014cited by this paper
Implicit stochastic gradient descent
2014influential reference
A Proximal Stochastic Gradient Method with Progressive Variance Reduction
2014cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
Asymptotic and finite-sample properties of estimators based on stochastic gradients
2014cited by this paper
Minimizing finite sums with the stochastic average gradient
2013cited by this paper
Stochastic Approximation approach to Stochastic Programming
2013influential reference
Proximal Algorithms
2013cited by this paper
Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)
2013influential reference
Neural Networks: Tricks of the Trade
2012cited by this paper
Stochastic Gradient Descent Tricks
2012cited by this paper
Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes
2012cited by this paper
Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent
2011influential reference
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
2011cited by this paper
Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
2011influential reference
Adaptive Bound Optimization for Online Convex Optimization
2010influential reference
Online Importance Weight Aware Updates
2010cited by this paper
Importance Weight Aware Gradient Updates
2010cited by this paper
Implicit Online Learning
2010cited by this paper
Efficient Learning using Forward-Backward Splitting
2009cited by this paper
A Geometric View of Non-Linear On-Line Stochastic Gradient Descent
2007cited by this paper
implicit Online Learning with Kernels
2006cited by this paper
Solving large scale linear prediction problems using stochastic gradient descent algorithms
2004influential reference
RCV1: A New Benchmark Collection for Text Categorization Research
2004cited by this paper
Stochastic Learning
2003cited by this paper
Introductory Lectures on Convex Optimization
2003cited by this paper
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
2001cited by this paper
Comparison of neural networks and discriminant analysis in predicting forest cover types
1998cited by this paper
Gradient-based learning applied to document recognition
1998cited by this paper
Self-Consistency: A Fundamental Concept in Statistics
1996cited by this paper
Stochastic approximation and optimization of random systems
1992cited by this paper
Acceleration of stochastic approximation by averaging
1992cited by this paper
Adaptive Algorithms and Stochastic Approximations
1990influential reference
Efficient Estimations from a Slowly Convergent Robbins-Monro Process
1988influential reference
Splitting Algorithms for the Sum of Two Nonlinear Operators
1979cited by this paper
Almost Sure Approximation of the Robbins-Monro Process by Sums of Independent Random Variables
1977influential reference
Monotone Operators and the Proximal Point Algorithm
1976cited by this paper
Stochastic Approximation and Recursive Estimation
1976cited by this paper
A learning method for system identification
1967cited by this paper
Efficient recursive estimation; application to estimating the parameters of a covariance function
1965cited by this paper
A Stochastic Approximation Method
1951cited by this paper
Noname manuscript No. (will be inserted by the editor) Incremental Proximal Methods for Large Scale Convex Optimization
year unknowncited by this paper
MIT Open Access Articles Incremental proximal methods for large scale convex optimization
year unknowninfluential reference
The{dollar}p{dollar}-Norm Generalization of the LMS Algorithm for Adaptive Filtering
year unknowncited by this paper

CITED BY

Implicit Q-Learning and SARSA: Liberating Policy Control from Step-Size Calibration
2026cites this paper
Exponentially weighted estimands and the exponential family: filtering, prediction and smoothing
2025cites this paper
Bregman Stochastic Proximal Point Algorithm with Variance Reduction
2025cites this paper
Revisiting Stochastic Proximal Point Methods: Generalized Smoothness and Similarity
2025cites this paper
Enhancing Stochastic Optimization for Statistical Efficiency Using ROOT-SGD with Diminishing Stepsize
2024cites this paper
Online estimation of the inverse of the Hessian for stochastic optimization with application to universal stochastic Newton algorithms
2024cites this paper
Stochastic Proximal Point Methods for Monotone Inclusions under Expected Similarity
2024cites this paper
Forward-backward Gaussian variational inference via JKO in the Bures-Wasserstein Space
2023cites this paper
Sharper Analysis for Minibatch Stochastic Proximal Point Methods: Stability, Smoothness, and Deviation
2023cites this paper
Stable Nonconvex-Nonconcave Training via Linear Interpolation
2023cites this paper
Variance Reduction Techniques for Stochastic Proximal Point Algorithms
2023cites this paper
"Plus/minus the learning rate": Easy and Scalable Statistical Inference with SGD
2023cites this paper
Formwork pressure prediction in cast-in-place self-compacting concrete using deep learning
2023cites this paper
Implicit Stochastic Gradient Descent for Training Physics-informed Neural Networks
2023cites this paper
A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks
2022cites this paper
Surprising Instabilities in Training Deep Networks and a Theoretical Analysis
2022cites this paper
Revisiting Deep Fisher Vectors: Using Fisher Information to Improve Object Classification
2022cites this paper
Robust Observation-Driven Models Using Proximal-Parameter Updates
2022cites this paper
Statistical inference with implicit SGD: proximal Robbins-Monro vs. Polyak-Ruppert
2022cites this paper
A Theoretical Framework for Inference Learning
2022cites this paper
The Dynamics of Functional Diversity throughout Neural Network Training
2021cites this paper
Sub-linear convergence of a tamed stochastic gradient descent method in Hilbert space
2021cites this paper
Stochastic Proximal Gradient Algorithm with Minibatches. Application to Large Scale Learning Models
2020cites this paper
Stochastic Gradient Population Monte Carlo
2020cites this paper
Sparse Convolutional Neural Networks for Genome-Wide Prediction
2020cites this paper
General Convergence Analysis of Stochastic First-Order Methods for Composite Optimization
2020cites this paper
Laguerre Polynomials and Gradient Descent Approach for Linear Quadratic Optimal Control
2020cites this paper
New nonasymptotic convergence rates of stochastic proximal point algorithm for stochastic convex optimization
2020cites this paper
Bellman filtering and smoothing for state–space models
2020cites this paper
Stochastic SPG with Minibatches
2020cites this paper
Sub-linear convergence of a stochastic proximal iteration method in Hilbert space
2020cites this paper
Bellman filtering for state-space models
2020cites this paper
Stochastic proximal splitting algorithm for stochastic composite minimization
2019influential citation
Random gradient algorithms for convex minimization over intersection of simple sets
2019cites this paper
Advances in Bayesian inference and stable optimization for large-scale machine learning problems
2019cites this paper
New nonasymptotic convergence rates of stochastic proximal pointalgorithm for convex optimization problems with many constraints
2019influential citation
On convergence rate of stochastic proximal point algorithm without strong convexity, smoothness or bounded gradients
2019influential citation
A Generic Acceleration Framework for Stochastic Composite Optimization
2019cites this paper
Stochastic proximal splitting algorithm for composite minimization
2019influential citation
Scalable statistical inference for averaged implicit stochastic gradient descent
2019cites this paper
Astro for Derivative-Based Stochastic Optimization: Algorithm Description & Numerical Experiments
2019cites this paper
Stochastic (Approximate) Proximal Point Methods: Convergence, Optimality, and Adaptivity
2018cites this paper
Robust Implicit Backpropagation
2018cites this paper
Unbiased scalable softmax optimization
2018influential citation
Convergence diagnostics for stochastic gradient descent with constant learning rate
2018cites this paper
GPU_MF_SGD: A Novel GPU-Based Stochastic Gradient Descent Method for Matrix Factorization
2018cites this paper
OR-SAGA: Over-relaxed stochastic average gradient mapping algorithms for finite sum minimization
2018cites this paper
Fully Implicit Online Learning
2018cites this paper
Convergence diagnostics for stochastic gradient descent with constant step size
2017cites this paper
Nonasymptotic convergence of stochastic proximal point algorithms for constrained convex optimization
2017influential citation
Stochastic Gradient Descent as Approximate Bayesian Inference
2017influential citation
Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization
2017influential citation
Stabilized Sparse Online Learning for Sparse Data
2016cites this paper
Dynamical Behavior of a Stochastic Forward–Backward Algorithm Using Random Monotone Operators
2016cites this paper
A Variational Analysis of Stochastic Gradient Algorithms
2016cites this paper
Stochastic Gradient Methods for Principled Estimation with Large Data Sets
2015influential citation
Implicit stochastic approximation
2015cites this paper
Dynamical Behavior of a Stochastic Forward–Backward Algorithm Using Random Monotone Operators
2015cites this paper
The proximal Robbins–Monro method
2015cites this paper
Stochastic gradient descent methods for estimation with large data sets
2015influential citation
Asymptotic and finite-sample properties of estimators based on stochastic gradients
2014cites this paper