On the Power of Over-parametrization in Neural Networks with Quadratic Activation

Published 2018 in International Conference on Machine Learning

ABSTRACT

We provide new theoretical insights on why over-parametrization is effective in learning neural networks. For a $k$ hidden node shallow network with quadratic activation and $n$ training data points, we show as long as $ k \ge \sqrt{2n}$, over-parametrization enables local search algorithms to find a \emph{globally} optimal solution for general smooth and convex loss functions. Further, despite that the number of parameters may exceed the sample size, using theory of Rademacher complexity, we show with weight decay, the solution also generalizes well if the data is sampled from a regular distribution such as Gaussian. To prove when $k\ge \sqrt{2n}$, the loss function has benign landscape properties, we adopt an idea from smoothed analysis, which may have other applications in studying loss surfaces of neural networks.

PUBLICATION RECORD

Publication year
2018
Venue
International Conference on Machine Learning
Publication date
2018-03-03
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1803.01206
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Towards Provable Learning of Polynomial Neural Networks Using Low-Rank Matrix Estimation
2018cited by this paper
Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form
2018influential reference
Generalization Error Bounds for Noisy, Iterative Algorithms
2018cited by this paper
Fisher-Rao Metric, Geometry, and Complexity of Neural Networks
2017cited by this paper
Algorithmic Regularization in Over-parameterized Matrix Recovery
2017cited by this paper
Exponentially vanishing sub-optimal local minima in multilayer neural networks
2017cited by this paper
The loss surface and expressivity of deep convolutional neural networks
2017cited by this paper
Optimization Landscape and Expressivity of Deep CNNs
2017cited by this paper
Spurious Local Minima are Common in Two-Layer ReLU Neural Networks
2017cited by this paper
The Loss Surface of Deep and Wide Neural Networks
2017cited by this paper
Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima
2017influential reference
Gradient Descent Can Take Exponential Time to Escape Saddle Points
2017influential reference
Learning One-hidden-layer Neural Networks with Landscape Design
2017cited by this paper
A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks
2017cited by this paper
When is a Convolutional Filter Easy To Learn?
2017influential reference
Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data
2017cited by this paper
Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints
2017cited by this paper
Energy Propagation in Deep Convolutional Neural Networks
2017cited by this paper
Exploring Generalization in Deep Learning
2017cited by this paper
Recovery Guarantees for One-hidden-layer Neural Networks
2017cited by this paper
Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks
2017cited by this paper
Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs
2017cited by this paper
Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods
2017cited by this paper
An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis
2017cited by this paper
How to Escape Saddle Points Efficiently
2017influential reference
Nearly-tight VC-dimension bounds for piecewise linear neural networks
2017cited by this paper
Size-Independent Sample Complexity of Neural Networks
2017cited by this paper
Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes
2017cited by this paper
Optimal Approximation with Sparsely Connected Deep Neural Networks
2017cited by this paper
Spectrally-normalized margin bounds for neural networks
2017cited by this paper
Learning Non-overlapping Convolutional Neural Networks with Multiple Kernels
2017cited by this paper
The Landscape of Deep Learning Algorithms
2017cited by this paper
SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data
2017cited by this paper
Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks
2017influential reference
Convergence Analysis of Two-layer Neural Networks with ReLU Activation
2017cited by this paper
Deep Learning without Poor Local Minima
2016cited by this paper
Language Modeling with Gated Convolutional Networks
2016cited by this paper
The non-convex Burer-Monteiro approach works on smooth semidefinite programs
2016influential reference
The Power of Normalization: Faster Evasion of Saddle Points
2016cited by this paper
Gradient Descent Only Converges to Minimizers
2016influential reference
Reliably Learning the ReLU in Polynomial Time
2016cited by this paper
Topology and Geometry of Half-Rectified Network Optimization
2016cited by this paper
Understanding deep learning requires rethinking generalization
2016cited by this paper
Mastering the game of Go with deep neural networks and tree search
2016cited by this paper
Benefits of Depth in Neural Networks
2016cited by this paper
Train faster, generalize better: Stability of stochastic gradient descent
2015cited by this paper
An Introduction to Matrix Concentration Inequalities
2015cited by this paper
Global Optimality in Tensor Factorization, Deep Learning, and Beyond
2015influential reference
Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition
2015influential reference
Norm-Based Capacity Control in Neural Networks
2015cited by this paper
The Loss Surfaces of Multilayer Networks
2014cited by this paper
Structured Low-Rank Matrix Factorization: Optimality, Algorithm, and Applications to Image Processing
2014influential reference
Provable Methods for Training Neural Networks with Sparse Connectivity
2014cited by this paper
On the Computational Efficiency of Training Neural Networks
2014cited by this paper
ImageNet classification with deep convolutional neural networks
2012cited by this paper
Rank, Trace-Norm and Max-Norm
2005cited by this paper
Maximum-Margin Matrix Factorization
2004cited by this paper
Approximation and estimation bounds for artificial neural networks
2004cited by this paper
A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization
2003cited by this paper
Empirical margin distributions and bounding the generalization error of combined classifiers
2002cited by this paper
The Geometry of Semidefinite Programming
2000cited by this paper
On the Rank of Extreme Matrices in Semidefinite Programs and the Multiplicity of Optimal Eigenvalues
1998cited by this paper
On the Rank of Extreme Matrices in Semideenite Programs and the Multiplicity of Optimal Eigenvalues
1997cited by this paper
Original Contribution: Training a 3-node neural network is NP-complete
1992cited by this paper
Local minima and back propagation
1991cited by this paper
Approximation and Estimation Bounds for Artificial Neural Networks
1991cited by this paper

CITED BY

The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks
2026cites this paper
It's not a Lottery, it's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task
2026cites this paper
Escaping Local Minima Provably in Non-convex Matrix Sensing: A Deterministic Framework via Simulated Lifting
2026cites this paper
Near-Quadratic Convergence of the Gauss–Newton Method for Complex Phase Retrieval
2026influential citation
Gradient-Based Adaptive Prediction and Control for Nonlinear Dynamical Systems
2026cites this paper
Why ReLU? A Bit-Model Dichotomy for Deep Network Training
2026cites this paper
Hadamard Product in Deep Learning: Introduction, Advances and Challenges
2025cites this paper
Geometry and Optimization of Shallow Polynomial Networks
2025cites this paper
A Scalable Lift-and-Project Differentiable Approach For the Maximum Cut Problem
2025cites this paper
Approximation, Estimation and Optimization Errors for a Deep Neural Network
2025cites this paper
Optimal generalisation and learning transition in extensive-width shallow neural networks near interpolation
2025cites this paper
Statistical mechanics of extensive-width Bayesian neural networks near interpolation
2025cites this paper
Information-Theoretic Guarantees for Recovering Low-Rank Tensors from Symmetric Rank-One Measurements
2025cites this paper
Evolutionary Developmental Biology Can Serve as the Conceptual Foundation for a New Design Paradigm in Artificial Intelligence
2025cites this paper
Understanding How Nonlinear Layers Create Linearly Separable Features for Low-Dimensional Data
2025influential citation
Regularized Over-Parametrized Neural Networks Learned by Gradient Descent Can Generalize Well
2025cites this paper
Convexified Message-Passing Graph Neural Networks
2025cites this paper
Foundations of a Developmental Design Paradigm for Integrated Continual Learning, Deliberative Behavior, and Comprehensibility
2025cites this paper
Why Neural Network Can Discover Symbolic Structures with Gradient-based Training: An Algebraic and Geometric Foundation for Neurosymbolic Reasoning
2025cites this paper
On the Parallels Between Evolutionary Theory and the State of AI
2025cites this paper
The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks
2025cites this paper
Understanding Inverse Reinforcement Learning under Overparameterization: Non-Asymptotic Analysis and Global Optimality
2025cites this paper
Evolution, Future of AI, and Singularity
2025cites this paper
Langevin Monte-Carlo Provably Learns Depth Two Neural Nets at Any Size and Data
2025cites this paper
Dynamic Rank Adjustment in Diffusion Policies for Efficient and Flexible Training
2025cites this paper
Curse of Dimensionality in Neural Network Optimization
2025cites this paper
On a spherically lifted spin model at finite temperature
2025cites this paper
Activations Through Extensions: A Framework To Boost Performance Of Neural Networks
2024cites this paper
Future Directions in the Theory of Graph Machine Learning
2024cites this paper
Rate of Convergence of an Over-Parametrized Convolutional Neural Network Image Classifier Learned by Gradient Descent
2024cites this paper
LoRA Training in the NTK Regime has No Spurious Local Minima
2024influential citation
Scaling Convex Neural Networks with Burer-Monteiro Factorization
2024influential citation
Analysis of the rate of convergence of an over-parametrized convolutional neural network image classifier learned by gradient descent
2024cites this paper
Depth Separation in Norm-Bounded Infinite-Width Neural Networks
2024cites this paper
Loss Landscape of Shallow ReLU-like Neural Networks: Stationary Points, Saddle Escape, and Network Embedding
2024cites this paper
The Challenges of the Nonlinear Regime for Physics-Informed Neural Networks
2024cites this paper
Nonlinear Behaviour of Critical Points for a Simple Neural Network
2024cites this paper
Composing Global Solutions to Reasoning Tasks via Algebraic Objects in Neural Nets
2024cites this paper
Bayes-optimal learning of an extensive-width neural network from quadratically many samples
2024cites this paper
Neural spectrahedra and semidefinite lifts: global convex optimization of degree-two polynomial activation neural networks in polynomial-time
2024cites this paper
Stochastic Bandits with ReLU Neural Networks
2024influential citation
Connectivity Shapes Implicit Regularization in Matrix Factorization Models for Matrix Completion
2024cites this paper
MESA: Cooperative Meta-Exploration in Multi-Agent Learning through Exploiting State-Action Space Structure
2024cites this paper
Hybrid Coordinate Descent for Efficient Neural Network Learning Using Line Search and Gradient Descent
2024cites this paper
Collaborative Knowledge Distillation
2024cites this paper
Variational Stochastic Gradient Descent for Deep Neural Networks
2024cites this paper
A practical, fast method for solving sum-of-squares problems for very large polynomials
2024cites this paper
Critical Influence of Overparameterization on Sharpness-aware Minimization
2023cites this paper
Boosting Defect Detection in Manufacturing using Tensor Convolutional Neural Networks
2023cites this paper
The Local Landscape of Phase Retrieval Under Limited Samples
2023cites this paper
Signal Processing Meets SGD: From Momentum to Filter
2023cites this paper
Exploring Gradient Oscillation in Deep Neural Network Training
2023cites this paper
On the Impact of Overparameterization on the Training of a Shallow Neural Network in High Dimensions
2023influential citation
Deep Learning of Joint Scalar PDFs in Turbulent Flames from Sparse Multiscalar Data
2023cites this paper
Spurious Valleys and Clustering Behavior of Neural Networks
2023cites this paper
Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets
2023cites this paper
Gradient is All You Need? How Consensus-Based Optimization can be Interpreted as a Stochastic Relaxation of Gradient Descent
2023cites this paper
Solving Large-Scale Spatial Problems with Convolutional Neural Networks
2023cites this paper
Scalable quantum neural networks by few quantum resources
2023cites this paper
Learning Task-Preferred Inference Routes for Gradient De-Conflict in Multi-Output DNNs
2023cites this paper
Rational Neural Network Controllers
2023cites this paper
Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks
2023cites this paper
Initialization-Dependent Sample Complexity of Linear Predictors and Neural Networks
2023cites this paper
Understanding Augmentation-based Self-Supervised Representation Learning via RKHS Approximation
2023cites this paper
Energy-Efficient Approximate Edge Inference Systems
2023cites this paper
Greedy Pruning with Group Lasso Provably Generalizes for Matrix Sensing and Neural Networks with Quadratic Activations
2023cites this paper
On the Performance of new Higher Order Transformation Functions for Highly Efficient Dense Layers
2023cites this paper
Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression
2023cites this paper
Convex Optimization of Deep Polynomial and ReLU Activation Neural Networks
2023cites this paper
Grafting constructive algorithm in feedforward neural network learning
2022cites this paper
Global Convergence of SGD On Two Layer Neural Nets
2022cites this paper
Interpretable Polynomial Neural Ordinary Differential Equations
2022cites this paper
On the Study of Sample Complexity for Polynomial Neural Networks
2022cites this paper
Gradient Descent Provably Escapes Saddle Points in the Training of Shallow ReLU Networks
2022cites this paper
Pushing the Efficiency Limit Using Structured Sparse Convolutions
2022cites this paper
On Expressivity and Training of Quadratic Networks
2022cites this paper
Embedding Principle in Depth for the Loss Landscape Analysis of Deep Neural Networks
2022cites this paper
STN: Scalable Tensorizing Networks via Structure-Aware Training and Adaptive Compression
2022cites this paper
Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials
2022cites this paper
Knowledge Distillation Meets Open-Set Semi-supervised Learning
2022cites this paper
On the influence of over-parameterization in manifold based surrogates and deep neural operators
2022cites this paper
The Spectral Bias of Polynomial Neural Networks
2022cites this paper
Meta-GF: Training Dynamic-Depth Neural Networks Harmoniously
2022cites this paper
Neural Networks can Learn Representations with Gradient Descent
2022cites this paper
On the existence of infinitely many realization functions of non-global local minima in the training of artificial neural networks with ReLU activation
2022cites this paper
Low-Pass Filtering SGD for Recovering Flat Optima in the Deep Learning Optimization Landscape
2022cites this paper
Complexity from Adaptive-Symmetries Breaking: Global Minima in the Statistical Mechanics of Deep Neural Networks
2022cites this paper
Noise Regularizes Over-parameterized Rank One Matrix Recovery, Provably
2022cites this paper
Phase Transition from Clean Training to Adversarial Training
2022cites this paper
COLT: Cyclic Overlapping Lottery Tickets for Faster Pruning of Convolutional Neural Networks
2022cites this paper
Learning Hybrid Precoding Efficiently for mmWave Systems with Mathematical Properties
2022cites this paper
Autoencoders for sample size estimation for fully connected neural network classifiers
2022cites this paper
The Sample Complexity of One-Hidden-Layer Neural Networks
2022cites this paper
On the Parallelization Upper Bound for Asynchronous Stochastic Gradients Descent in Non-convex Optimization
2022cites this paper
Landscape analysis for shallow ReLU neural networks: complete classification of critical points for affine target functions
2021cites this paper
The Smoking Gun: Statistical Theory Improves Neural Network Estimates
2021cites this paper
Expressivity and Trainability of Quadratic Networks
2021cites this paper
The Effect of the Intrinsic Dimension on the Generalization of Quadratic Classifiers
2021cites this paper
Investigating the locality of neural network training dynamics
2021cites this paper
Try before You Buy: Privacy-preserving Data Evaluation on Cloud-based Machine Learning Data Marketplace
2021cites this paper