Mean-Field Analysis of Two-Layer Neural Networks: Non-Asymptotic Rates and Generalization Bounds

Zixiang Chen,Yuan Cao,Quanquan Gu,Tong Zhang

Published 2020 in arXiv.org

ABSTRACT

A recent line of work in deep learning theory has utilized the mean-field analysis to demonstrate the global convergence of noisy (stochastic) gradient descent for training over-parameterized two-layer neural networks. However, existing results in the mean-field setting do not provide the convergence rate of neural network training, and the generalization error bound is largely missing. In this paper, we provide a mean-field analysis in a generalized neural tangent kernel regime, and show that noisy gradient descent with weight decay can still exhibit a "kernel-like" behavior. This implies that the training loss converges linearly up to a certain accuracy in such regime. We also establish a generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay. Our results shed light on the connection between mean field analysis and the neural tangent kernel based analysis.

PUBLICATION RECORD

Publication year
2020
Venue
arXiv.org
Publication date
2020-02-10
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 2002.04026
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Foundations of Machine Learning
2021influential reference
Function approximation by neural nets in the mean-field regime: Entropic regularization and controlled McKean-Vlasov dynamics
2020cited by this paper
Backward Feature Correction: How Deep Learning Performs Deep Learning
2020cited by this paper
Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning
2020cited by this paper
Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks
2019cited by this paper
Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks
2019influential reference
Tight Sample Complexity of Learning One-hidden-layer Convolutional Neural Networks
2019influential reference
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
2019influential reference
Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks
2019cited by this paper
Wide neural networks of any depth evolve as linear models under gradient descent
2019cited by this paper
Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit
2019influential reference
Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks
2019influential reference
Refined Generalization Analysis of Gradient Descent for Over-parameterized Two-layer Neural Networks with Smooth Activations on Classification Problems
2019cited by this paper
Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models
2019cited by this paper
On Exact Computation with an Infinitely Wide Neural Net
2019influential reference
Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks
2019cited by this paper
The implicit bias of gradient descent on nonseparable data
2019cited by this paper
Gradient Descent Maximizes the Margin of Homogeneous Neural Networks
2019cited by this paper
An Improved Analysis of Training Over-parameterized Deep Neural Networks
2019influential reference
Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks
2019influential reference
On Learning Over-parameterized Neural Networks: A Functional Approximation Prospective
2019cited by this paper
What Can ResNet Learn Efficiently, Going Beyond Kernels?
2019influential reference
Towards Understanding the Spectral Bias of Deep Learning
2019cited by this paper
Gradient descent optimizes over-parameterized deep ReLU networks
2019influential reference
How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?
2019influential reference
Convex Formulation of Overparameterized Deep Neural Networks
2019influential reference
Over Parameterized Two-level Neural Networks Can Learn Near Optimal Feature Representations
2019influential reference
Gradient Descent can Learn Less Over-parameterized Two-layer Neural Networks on Classification Problems
2019cited by this paper
On Lazy Training in Differentiable Programming
2018influential reference
Characterizing Implicit Bias in Terms of Optimization Geometry
2018cited by this paper
Stronger generalization bounds for deep nets via a compression approach
2018cited by this paper
A mean field view of the landscape of two-layer neural networks
2018influential reference
On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport
2018influential reference
Implicit Bias of Gradient Descent on Linear Convolutional Networks
2018cited by this paper
Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate
2018cited by this paper
On Tighter Generalization Bound for Deep Neural Networks: CNNs, ResNets, and Beyond
2018influential reference
Learning One-hidden-layer ReLU Networks via Gradient Descent
2018cited by this paper
Neural Tangent Kernel: Convergence and Generalization in Neural Networks
2018influential reference
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
2018cited by this paper
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
2018influential reference
A Convergence Theory for Deep Learning via Over-Parameterization
2018influential reference
Gradient Descent Finds Global Minima of Deep Neural Networks
2018influential reference
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
2018influential reference
Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks
2018cited by this paper
Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel
2018influential reference
Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization
2017influential reference
Size-Independent Sample Complexity of Neural Networks
2017cited by this paper
When is a Convolutional Filter Easy To Learn?
2017cited by this paper
A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks
2017cited by this paper
Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations
2017influential reference
Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima
2017influential reference
Learning ReLUs via Gradient Descent
2017cited by this paper
An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis
2017influential reference
Spectrally-normalized margin bounds for neural networks
2017influential reference
Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs
2017cited by this paper
Convergence Analysis of Two-layer Neural Networks with ReLU Activation
2017cited by this paper
Recovery Guarantees for One-hidden-layer Neural Networks
2017cited by this paper
The Implicit Bias of Gradient Descent on Separable Data
2017cited by this paper
Implicit Regularization in Matrix Factorization
2017cited by this paper
Mastering the game of Go with deep neural networks and tree search
2016cited by this paper
Norm-Based Capacity Control in Neural Networks
2015cited by this paper
Understanding Machine Learning - From Theory to Algorithms
2014influential reference
Breaking the Curse of Dimensionality with Convex Neural Networks
2014cited by this paper
Analysis and Geometry of Markov Diffusion Operators
2013influential reference
ImageNet classification with deep convolutional neural networks
2012cited by this paper
Generalization Error Bounds for Bayesian Mixture Algorithms
2003cited by this paper
Rademacher and Gaussian Complexities: Risk Bounds and Structural Results
2003influential reference
Generalization of an Inequality by Talagrand and Links with the Logarithmic Sobolev Inequality
2000influential reference
Asymptotic evaluation of certain Markov process expectations for large time
1975cited by this paper

CITED BY

An Exact Kernel Equivalence for Finite Classification Models
2023cites this paper
On the generalization of learning algorithms that do not converge
2022cites this paper
One-pass Stochastic Gradient Descent in overparametrized two-layer neural networks
2021cites this paper
Wasserstein Flow Meets Replicator Dynamics: A Mean-Field Analysis of Representation Learning in Actor-Critic
2021cites this paper
Predicting the outputs of finite deep neural networks trained with noisy gradients.
2020cites this paper
Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Analysis
2020influential citation
Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory
2020cites this paper
Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks
2020cites this paper
Exploring entanglement and optimization within the Hamiltonian Variational Ansatz
2020cites this paper
Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks
2019cites this paper