Convergence of Gradient Descent on Separable Data

M. S. Nacson,J. Lee,Suriya Gunasekar,N. Srebro,Daniel Soudry

Published 2018 in International Conference on Artificial Intelligence and Statistics

ABSTRACT

We provide a detailed study on the implicit bias of gradient descent when optimizing loss functions with strictly monotone tails, such as the logistic loss, over separable datasets. We look at two basic questions: (a) what are the conditions on the tail of the loss function under which gradient descent converges in the direction of the $L_2$ maximum-margin separator? (b) how does the rate of margin convergence depend on the tail of the loss function and the choice of the step size? We show that for a large family of super-polynomial tailed losses, gradient descent iterates on linear networks of any depth converge in the direction of $L_2$ maximum-margin solution, while this does not hold for losses with heavier tails. Within this family, for simple linear models we show that the optimal rates with fixed step size is indeed obtained for the commonly used exponentially tailed losses such as logistic loss. However, with a fixed step size the optimal convergence rate is extremely slow as $1/\log(t)$, as also proved in Soudry et al. (2018). For linear models with exponential loss, we further prove that the convergence rate could be improved to $\log (t) /\sqrt{t}$ by using aggressive step sizes that compensates for the rapidly vanishing gradients. Numerical results suggest this method might be useful for deep networks.

PUBLICATION RECORD

Publication year
2018
Venue
International Conference on Artificial Intelligence and Statistics
Publication date
2018-03-05
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1803.01905
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Gradient descent aligns the layers of deep linear networks
2018cited by this paper
Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate
2018cited by this paper
Implicit Bias of Gradient Descent on Linear Convolutional Networks
2018influential reference
Risk and parameter convergence of logistic regression
2018influential reference
Convergence of SGD in Learning ReLU Models with Separable Data
2018cited by this paper
Characterizing Implicit Bias in Terms of Optimization Geometry
2018influential reference
When will gradient methods converge to max‐margin classifier under ReLU models?
2018cited by this paper
The Implicit Bias of Gradient Descent on Separable Data
2017cited by this paper
Train longer, generalize better: closing the generalization gap in large batch training of neural networks
2017cited by this paper
The Power of Normalization: Faster Evasion of Saddle Points
2016cited by this paper
Understanding deep learning requires rethinking generalization
2016cited by this paper
Wide Residual Networks
2016influential reference
In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning
2014cited by this paper
Boosting: Foundations and Algorithms
2013cited by this paper
Margins, Shrinkage, and Boosting
2013influential reference
Sublinear Optimization for Machine Learning
2010cited by this paper
Learning Multiple Layers of Features from Tiny Images
2009cited by this paper
Margin Maximizing Loss Functions
2003influential reference

CITED BY

The Effect of Mini-Batch Noise on the Implicit Bias of Adam
2026cites this paper
Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge
2026cites this paper
Over-Alignment vs Over-Fitting: The Role of Feature Learning Strength in Generalization
2026cites this paper
It's not a Lottery, it's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task
2026cites this paper
The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks
2026influential citation
Exponential Convergence of (Stochastic) Gradient Descent for Separable Logistic Regression
2026cites this paper
On the Rate of Convergence of GD in Non-linear Neural Networks: An Adversarial Robustness Perspective
2026influential citation
Multiclass Loss Geometry Matters for Generalization of Gradient Descent in Separable Classification
2025cites this paper
Variational Deep Learning via Implicit Regularization
2025cites this paper
Embedding principle of homogeneous neural network for classification problem
2025influential citation
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias
2025cites this paper
Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes
2025cites this paper
Scalable Model Merging with Progressive Layer-wise Distillation
2025cites this paper
The late-stage training dynamics of (stochastic) subgradient descent on homogeneous neural networks
2025influential citation
Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data
2025influential citation
How Memory in Optimization Algorithms Implicitly Modifies the Loss
2025cites this paper
Implicit Bias and Invariance: How Hopfield Networks Efficiently Learn Graph Orbits
2025cites this paper
Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics
2025cites this paper
PENEX: AdaBoost-Inspired Neural Network Regularization
2025cites this paper
Beyond Statistical Learning: Exact Learning Is Essential for General Intelligence
2025cites this paper
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
2025cites this paper
The Rich and the Simple: On the Implicit Bias of Adam and SGD
2025cites this paper
Grokking at the Edge of Linear Separability
2024cites this paper
Non-asymptotic Convergence of Training Transformers for Next-token Prediction
2024cites this paper
Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations
2024cites this paper
Incremental Gauss-Newton Descent for Machine Learning
2024influential citation
Implicit Bias of Mirror Flow on Separable Data
2024cites this paper
Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization
2024cites this paper
Improving Generalization and Convergence by Enhancing Implicit Regularization
2024cites this paper
Achieving Group Distributional Robustness and Minimax Group Fairness with Interpolating Classifiers
2024influential citation
Nonconvex Stochastic Optimization under Heavy-Tailed Noises: Optimal Convergence without Gradient Clipping
2024cites this paper
Implicit Bias of AdamW: ℓ∞ Norm Constrained Optimization
2024cites this paper
Fast Test Error Rates for Gradient-Based Algorithms on Separable Data
2024influential citation
Posterior Uncertainty Quantification in Neural Networks using Data Augmentation
2024cites this paper
Implicit Regularization of Gradient Flow on One-Layer Softmax Attention
2024influential citation
Simplicity Bias of Transformers to Learn Low Sensitivity Functions
2024cites this paper
Transformers Learn Low Sensitivity Functions: Investigations and Implications
2024cites this paper
The Implicit Bias of Heterogeneity towards Invariance and Causality
2024cites this paper
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
2024cites this paper
Implicit Optimization Bias of Next-token Prediction in Linear Models
2024cites this paper
Implicit Bias and Fast Convergence Rates for Self-attention
2024influential citation
The Implicit Bias of Heterogeneity towards Invariance: A Study of Multi-Environment Matrix Sensing
2024cites this paper
The Implicit Bias of Gradient Descent on Separable Multiclass Data
2024cites this paper
Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection
2024cites this paper
Faster Margin Maximization Rates for Generic Optimization Methods
2023influential citation
Gradient Descent Converges Linearly for Logistic Regression on Separable Data
2023cites this paper
Convergence beyond the over-parameterized regime using Rayleigh quotients
2023influential citation
Fast Convergence in Learning Two-Layer Neural Networks with Separable Data
2023influential citation
Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability
2023cites this paper
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be
2023cites this paper
A Unified Approach to Controlling Implicit Regularization via Mirror Descent
2023influential citation
Faster margin maximization rates for generic and adversarially robust optimization methods
2023influential citation
Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking
2023cites this paper
Achieving Margin Maximization Exponentially Fast via Progressive Norm Rescaling
2023influential citation
Hierarchical Simplicity Bias of Neural Networks
2023cites this paper
Relevance gradient descent for parameter optimization of image enhancement
2023cites this paper
The Law of Parsimony in Gradient Descent for Learning Deep Linear Networks
2023cites this paper
A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time
2023cites this paper
On the Implicit Bias of Adam
2023cites this paper
Transformers as Support Vector Machines
2023cites this paper
How to induce regularization in linear models: A guide to reparametrizing gradient flow
2023cites this paper
The Implicit Bias of Minima Stability in Multivariate Shallow ReLU Networks
2023cites this paper
Max-Margin Token Selection in Attention Mechanism
2023cites this paper
Tight Risk Bounds for Gradient Descent on Separable Data
2023cites this paper
On the Training Instability of Shuffling SGD with Batch Normalization
2023cites this paper
Generalization and Stability of Interpolating Neural Networks with Minimal Width
2023cites this paper
Implicit Regularization for Group Sparsity
2023cites this paper
Margin Maximization in Attention Mechanism
2023cites this paper
Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from KKT Conditions for Margin Maximization
2023cites this paper
The Implicit Bias of Batch Normalization in Linear Models and Two-layer Linear Convolutional Neural Networks
2023cites this paper
General Loss Functions Lead to (Approximate) Interpolation in High Dimensions
2023cites this paper
Generalization for multiclass classification with overparameterized linear models
2022cites this paper
On Non-local Convergence Analysis of Deep Linear Networks
2022cites this paper
Implicit Bias of Gradient Descent on Reparametrized Models: On Equivalence to Mirror Descent
2022cites this paper
Imbalance Trouble: Revisiting Neural-Collapse Geometry
2022cites this paper
Kernel Memory Networks: A Unifying Framework for Memory Modeling
2022cites this paper
Importance Tempering: Group Robustness for Overparameterized Models
2022cites this paper
On the Implicit Bias in Deep-Learning Algorithms
2022cites this paper
Global Wasserstein Margin maximization for boosting generalization in adversarial training
2022cites this paper
Decentralized Learning with Separable Data: Generalization and Fast Algorithms
2022cites this paper
The Asymmetric Maximum Margin Bias of Quasi-Homogeneous Neural Networks
2022influential citation
On Accelerated Perceptrons and Beyond
2022cites this paper
Mechanistic Mode Connectivity
2022cites this paper
Iterative regularization in classification via hinge loss diagonal descent
2022influential citation
On Generalization of Decentralized Learning with Separable Data
2022cites this paper
Global Convergence Analysis of Deep Linear Networks with A One-neuron Layer
2022cites this paper
On Generalization Bounds for Deep Networks Based on Loss Surface Implicit Regularization
2022cites this paper
Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks
2022cites this paper
Implicit Regularization Towards Rank Minimization in ReLU Networks
2022cites this paper
Stability vs Implicit Bias of Gradient Methods on Separable Data and Beyond
2022cites this paper
High-dimensional Asymptotics of Langevin Dynamics in Spiked Matrix Models
2022cites this paper
Mirror Descent Maximizes Generalized Margin and Can Be Implemented Efficiently
2022cites this paper
Spectral Bias Outside the Training Set for Deep Networks in the Kernel Regime
2022cites this paper
Does Momentum Change the Implicit Regularization on Separable Data?
2021cites this paper
Connecting Interpretability and Robustness in Decision Trees through Separation
2021cites this paper
Bridging the Gap Between Adversarial Robustness and Optimization Bias
2021influential citation
Dissecting Supervised Constrastive Learning
2021influential citation
Implicit Regularization in Tensor Factorization
2021cites this paper
Label-Imbalanced and Group-Sensitive Classification under Overparameterization
2021cites this paper
The Low-Rank Simplicity Bias in Deep Networks
2021cites this paper