Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks

Published 2017 in IEEE Transactions on Information Theory

ABSTRACT

In this paper, we study the problem of learning a shallow artificial neural network that best fits a training data set. We study this problem in the over-parameterized regime where the numbers of observations are fewer than the number of parameters in the model. We show that with the quadratic activations, the optimization landscape of training, such shallow neural networks, has certain favorable characteristics that allow globally optimal models to be found efficiently using a variety of local search heuristics. This result holds for an arbitrary training data of input/output pairs. For differentiable activation functions, we also show that gradient descent, when suitably initialized, converges at a linear rate to a globally optimal model. This result focuses on a realizable model where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to planted weight coefficients.

PUBLICATION RECORD

Publication year
2017
Venue
IEEE Transactions on Information Theory
Publication date
2017-07-16
Fields of study
Mathematics, Computer Science
Identifiers
DOI 10.1109/TIT.2018.2854560 arXiv 1707.04926
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

SPECTRALLY-NORMALIZED MARGIN BOUNDS FOR NEURAL NETWORKS
2018cited by this paper
How to Escape Saddle Points Efficiently
2017cited by this paper
An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis
2017cited by this paper
Algorithmic Regularization in Over-parameterized Matrix Recovery
2017cited by this paper
Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods
2017cited by this paper
Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs
2017cited by this paper
Convergence Analysis of Two-layer Neural Networks with ReLU Activation
2017influential reference
The CNN as a Guided Multilayer RECOS Transform [Lecture Notes]
2017cited by this paper
Structured Signal Recovery From Quadratic Measurements: Breaking Sample Complexity Barriers via Nonconvex Optimization
2017cited by this paper
Energy Propagation in Deep Convolutional Neural Networks
2017cited by this paper
Electron-Proton Dynamics in Deep Learning
2017cited by this paper
Exponentially vanishing sub-optimal local minima in multilayer neural networks
2017cited by this paper
Spurious Local Minima are Common in Two-Layer ReLU Neural Networks
2017cited by this paper
The Loss Surface of Deep and Wide Neural Networks
2017cited by this paper
Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations
2017cited by this paper
Optimal Approximation with Sparsely Connected Deep Neural Networks
2017cited by this paper
Convergence Results for Neural Networks via Electrodynamics
2017cited by this paper
Learning ReLUs via Gradient Descent
2017cited by this paper
Recovery Guarantees for One-hidden-layer Neural Networks
2017cited by this paper
Spectrally-normalized margin bounds for neural networks
2017cited by this paper
Accelerated Methods for Non-Convex Optimization
2016cited by this paper
Deep Learning without Poor Local Minima
2016cited by this paper
No bad local minima: Data independent training error guarantees for multilayer neural networks
2016cited by this paper
Gradient Descent Only Converges to Minimizers
2016cited by this paper
Reliably Learning the ReLU in Polynomial Time
2016cited by this paper
Understanding deep learning requires rethinking generalization
2016cited by this paper
Gradient Descent Efficiently Finds the Cubic-Regularized Non-Convex Newton Step
2016cited by this paper
Diversity Leads to Generalization in Neural Networks
2016cited by this paper
Benefits of Depth in Neural Networks
2016cited by this paper
Finding approximate local minima faster than gradient descent
2016cited by this paper
The landscape of empirical risk for nonconvex losses
2016cited by this paper
A trust region algorithm with a worst-case iteration complexity of O(ϵ-3/2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{docume
2016cited by this paper
Finding Approximate Local Minima for Nonconvex Optimization in Linear Time
2016cited by this paper
Diverse Neural Network Learns True Target Functions
2016cited by this paper
The Power of Normalization: Faster Evasion of Saddle Points
2016cited by this paper
Gradient Descent Converges to Minimizers
2016cited by this paper
Beyond Convexity: Stochastic Quasi-Convex Optimization
2015cited by this paper
Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition
2015cited by this paper
Global Optimality in Tensor Factorization, Deep Learning, and Beyond
2015cited by this paper
ℓ1-regularized Neural Networks are Improperly Learnable in Polynomial Time
2015cited by this paper
On the Computational Efficiency of Training Neural Networks
2014cited by this paper
A note on the Hanson-Wright inequality for random vectors with dependencies
2014cited by this paper
Phase Retrieval via Wirtinger Flow: Theory and Algorithms
2014cited by this paper
The Loss Surfaces of Multilayer Networks
2014cited by this paper
Acoustic Modeling Using Deep Belief Networks
2012cited by this paper
ImageNet classification with deep convolutional neural networks
2012cited by this paper
Learning Kernel-Based Halfspaces with the 0-1 Loss
2011cited by this paper
Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
2011cited by this paper
Introduction to the non-asymptotic analysis of random matrices
2010cited by this paper
Sharp bounds on the rate of convergence of the empirical covariance matrix
2010cited by this paper
Restricted Isometry Property of Matrices with Independent Columns and Neighborly Polytopes by Random Sampling
2009cited by this paper
The Isotron Algorithm: High-Dimensional Isotonic Regression
2009cited by this paper
A unified architecture for natural language processing: deep neural networks with multitask learning
2008cited by this paper
Cubic regularization of Newton method and its global performance
2006cited by this paper
Approximation and estimation bounds for artificial neural networks
2004cited by this paper
The concentration of measure phenomenon
2001cited by this paper
Functional optimization of online algorithms in multilayer neural networks
1997cited by this paper
Transient dynamics of on-line learning in two-layered neural networks
1996cited by this paper
On-line learning in soft committee machines.
1995influential reference
Original Contribution: Training a 3-node neural network is NP-complete
1992cited by this paper
Introduction to Random Matrices
1992cited by this paper
Local minima and back propagation
1991cited by this paper

CITED BY

Why ReLU? A Bit-Model Dichotomy for Deep Network Training
2026cites this paper
Assembly and iteration: Transition to linearity of wide neural networks
2026cites this paper
Near-Quadratic Convergence of the Gauss–Newton Method for Complex Phase Retrieval
2026influential citation
The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks
2026cites this paper
Visualization and Analysis of the Loss Landscape in Graph Neural Networks
2025cites this paper
Second-order methods for provably escaping strict saddle points in composite nonconvex and nonsmooth optimization
2025cites this paper
Nonlinear Dynamics In Optimization Landscape of Shallow Neural Networks with Tunable Leaky ReLU
2025cites this paper
Convexified Message-Passing Graph Neural Networks
2025cites this paper
A Metric Topology of Deep Learning for Data Classification
2025cites this paper
Machine Learning Based Risk Analysis and predictive Modeling of Structure Fire related Casualties
2025cites this paper
Statistical mechanics of extensive-width Bayesian neural networks near interpolation
2025cites this paper
Learning with Restricted Boltzmann Machines: Asymptotics of AMP and GD in High Dimensions
2025cites this paper
On the Convergence of (Stochastic) Gradient Descent for Kolmogorov–Arnold Networks
2025cites this paper
Curse of Dimensionality in Neural Network Optimization
2025cites this paper
Geometry and Optimization of Shallow Polynomial Networks
2025cites this paper
Efficient identification of wide shallow neural networks with biases
2025cites this paper
Optimizers Qualitatively Alter Solutions And We Should Leverage This
2025cites this paper
Statistical Inference for Online Algorithms
2025cites this paper
Model-Free Design and Analysis of 2DOF PI Controller for Noisy LTI Systems
2025influential citation
The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks
2025cites this paper
Approximation, Estimation and Optimization Errors for a Deep Neural Network
2025cites this paper
Optimal generalisation and learning transition in extensive-width shallow neural networks near interpolation
2025cites this paper
ARGO: Overcoming hardware dependence in distributed learning
2025cites this paper
Quantum Relational Knowledge Distillation
2025cites this paper
Deep Reinforcement Learning-Based Resource Allocation with Enhanced Perception and Low-Latency for Autonomous Driving in ISAC-aided VEC
2025cites this paper
A Law of Data Reconstruction for Random Features (and Beyond)
2025cites this paper
Models of Heavy-Tailed Mechanistic Universality
2025cites this paper
Rethinking cell-based neural architecture search: A theoretical perspective
2025cites this paper
Understanding Learning Invariance in Deep Linear Networks
2025cites this paper
Understanding How Nonlinear Layers Create Linearly Separable Features for Low-Dimensional Data
2025influential citation
Information-Theoretic Guarantees for Recovering Low-Rank Tensors from Symmetric Rank-One Measurements
2025cites this paper
Interplay between Bayesian neural networks and deep learning: A survey
2025cites this paper
Training Classical Neural Networks by Quantum Machine Learning
2024cites this paper
Investigating the Histogram Loss in Regression
2024cites this paper
Near-Optimal Solutions of Constrained Learning Problems
2024cites this paper
LoRA Training in the NTK Regime has No Spurious Local Minima
2024cites this paper
Loss Landscape of Shallow ReLU-like Neural Networks: Stationary Points, Saddle Escape, and Network Embedding
2024cites this paper
Harmonizing florbetapir and PiB PET measurements of cortical Aβ plaque burden using multiple regions‐of‐interest and machine learning techniques: An alternative to the Centiloid approach
2024cites this paper
Uncertainty Quantification of Graph Convolution Neural Network Models of Evolving Processes
2024cites this paper
Error Analysis of Three-Layer Neural Network Trained With PGD for Deep Ritz Method
2024cites this paper
SF-DQN: Provable Knowledge Transfer using Successor Feature for Deep Reinforcement Learning
2024cites this paper
Globally Q-linear Gauss-Newton Method for Overparameterized Non-convex Matrix Sensing
2024cites this paper
The Persistence of Neural Collapse Despite Low-Rank Bias: An Analytic Perspective Through Unconstrained Features
2024cites this paper
Neural spectrahedra and semidefinite lifts: global convex optimization of degree-two polynomial activation neural networks in polynomial-time
2024cites this paper
Nonlinear Behaviour of Critical Points for a Simple Neural Network
2024cites this paper
Quantum-Train Long Short-Term Memory: Application on Flood Prediction Problem
2024cites this paper
Bayes-optimal learning of an extensive-width neural network from quadratically many samples
2024cites this paper
The Landscape of Deterministic and Stochastic Optimal Control Problems: One-Shot Optimization Versus Dynamic Programming
2024influential citation
Quantum-Train: rethinking hybrid quantum-classical machine learning in the model compression perspective
2024cites this paper
Physics-Informed Neural Networks for Power Systems Warm-Start Optimization
2024cites this paper
Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia
2024cites this paper
Leveraging Sparse Input and Sparse Models: Efficient Distributed Learning in Resource-Constrained Environments
2024cites this paper
Early Directional Convergence in Deep Homogeneous Neural Networks for Small Initializations
2024cites this paper
On the spectral bias of two-layer linear networks
2023cites this paper
Data-Driven Reliability Models of Quantum Circuit: From Traditional ML to Graph Neural Network
2023cites this paper
Boosting Defect Detection in Manufacturing using Tensor Convolutional Neural Networks
2023cites this paper
Learning Hierarchical Polynomials with Three-Layer Neural Networks
2023cites this paper
Hidden Minima in Two-Layer ReLU Networks
2023cites this paper
Fundamental Limits of Deep Learning-Based Binary Classifiers Trained with Hinge Loss
2023cites this paper
Assessing deep learning: a work program for the humanities in the age of artificial intelligence
2023cites this paper
Gradient Descent Finds the Global Optima of Two-Layer Physics-Informed Neural Networks
2023cites this paper
A Proximal Gradient Method for Regularized Deep Neural Networks
2023cites this paper
Global Optimality in Bivariate Gradient-based DAG Learning
2023cites this paper
Solving Large-Scale Spatial Problems with Convolutional Neural Networks
2023cites this paper
Regret Guarantees for Online Deep Control
2023cites this paper
On the Convergence and Sample Complexity Analysis of Deep Q-Networks with ε-Greedy Exploration
2023cites this paper
How Spurious Features are Memorized: Precise Analysis for Random and NTK Features
2023cites this paper
Resilient Constrained Learning
2023cites this paper
Convex Optimization of Deep Polynomial and ReLU Activation Neural Networks
2023cites this paper
An Experimental Survey of Missing Data Imputation Algorithms
2023cites this paper
An Analysis of the Influence of Surface Roughness and Clearance on the Dynamic Behavior of Deep Groove Ball Bearings Using Artificial Neural Networks
2023cites this paper
Should Under-parameterized Student Networks Copy or Average Teacher Weights?
2023cites this paper
NTK-SAP: Improving neural network pruning by aligning training dynamics
2023cites this paper
SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models
2023cites this paper
Class Notes for ASU Course CSE 691; Spring 2023 Topics in Reinforcement Learning
2023cites this paper
Greedy Pruning with Group Lasso Provably Generalizes for Matrix Sensing and Neural Networks with Quadratic Activations
2023cites this paper
Beyond the Universal Law of Robustness: Sharper Laws for Random Features and Neural Tangent Kernels
2023cites this paper
Generalization and Stability of Interpolating Neural Networks with Minimal Width
2023cites this paper
Fast Convergence of Random Reshuffling under Over-Parameterization and the Polyak-Łojasiewicz Condition
2023cites this paper
Practical Quantum Search by Variational Quantum Eigensolver on Noisy Intermediate-Scale Quantum Hardware
2023cites this paper
Essential barrier height and a probabilistic approach in characterizing potential landscape
2023cites this paper
Multi-field relation mining for malicious HTTP traffic detection based on attention and cross network
2023cites this paper
The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing
2023cites this paper
Dynamics in Deep Classifiers Trained with the Square Loss: Normalization, Low Rank, Neural Collapse, and Generalization Bounds
2023cites this paper
Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks
2023cites this paper
Better NTK Conditioning: A Free Lunch from (ReLU) Nonlinear Activation in Wide Neural Networks
2023cites this paper
Convergence of stochastic gradient descent under a local Lajasiewicz condition for deep neural networks
2023cites this paper
Optical and Electrical Memories for Analog Optical Computing
2023cites this paper
Gradient is All You Need? How Consensus-Based Optimization can be Interpreted as a Stochastic Relaxation of Gradient Descent
2023cites this paper
Efficient Online Processing with Deep Neural Networks
2023cites this paper
On the Impact of Overparameterization on the Training of a Shallow Neural Network in High Dimensions
2023cites this paper
Spurious Valleys and Clustering Behavior of Neural Networks
2023cites this paper
The landscape of the optimal control problem: One-shot Optimization versus Dynamic Programming
2023cites this paper
Intersection of Parallels as an Early Stopping Criterion
2022cites this paper
Large-scale photonic natural language processing
2022cites this paper
Demonstration of ML-Assisted Soft-Failure Localization Based on Network Digital Twins
2022cites this paper
DiBB: distributing black-box optimization
2022cites this paper
Neural Networks can Learn Representations with Gradient Descent
2022cites this paper
Chaos Prediction of Power Systems by Using Deep Learning
2022cites this paper
Theoretical Perspectives on Deep Learning Methods in Inverse Problems
2022cites this paper