Learning ReLUs via Gradient Descent

Published 2017 in Neural Information Processing Systems

ABSTRACT

In this paper we study the problem of learning Rectified Linear Units (ReLUs) which are functions of the form $max(0, )$ with $w$ denoting the weight vector. We study this problem in the high-dimensional regime where the number of observations are fewer than the dimension of the weight vector. We assume that the weight vector belongs to some closed set (convex or nonconvex) which captures known side-information about its structure. We focus on the realizable model where the inputs are chosen i.i.d.~from a Gaussian distribution and the labels are generated according to a planted weight vector. We show that projected gradient descent, when initialization at 0, converges at a linear rate to the planted model with a number of samples that is optimal up to numerical constants. Our results on the dynamics of convergence of these very shallow neural nets may provide some insights towards understanding the dynamics of deeper architectures.

PUBLICATION RECORD

Publication year
2017
Venue
Neural Information Processing Systems
Publication date
2017-05-10
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1705.04591
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Convergence Analysis of Two-layer Neural Networks with ReLU Activation
2017cited by this paper
Recovery Guarantees for One-hidden-layer Neural Networks
2017cited by this paper
Structured Signal Recovery From Quadratic Measurements: Breaking Sample Complexity Barriers via Nonconvex Optimization
2017influential reference
An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis
2017cited by this paper
The Loss Surface of Deep and Wide Neural Networks
2017cited by this paper
Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs
2017cited by this paper
Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks
2017cited by this paper
Fast and Reliable Parameter Estimation from Nonlinear Observations
2016cited by this paper
Reliably Learning the ReLU in Polynomial Time
2016cited by this paper
Sharp Time–Data Tradeoffs for Linear Inverse Problems
2015cited by this paper
Learning Single Index Models in High Dimensions
2015cited by this paper
Phase Retrieval via Wirtinger Flow: Theory and Algorithms
2014cited by this paper
Living on the edge: phase transitions in convex programs with random data
2013cited by this paper
ImageNet classification with deep convolutional neural networks
2012cited by this paper
Acoustic Modeling Using Deep Belief Networks
2012cited by this paper
Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
2011cited by this paper
The Convex Geometry of Linear Inverse Problems
2010cited by this paper
Introduction to the non-asymptotic analysis of random matrices
2010cited by this paper
The Isotron Algorithm: High-Dimensional Isotonic Regression
2009cited by this paper
A unified architecture for natural language processing: deep neural networks with multitask learning
2008cited by this paper
Direct Semiparametric Estimation of Single-Index Models with Discrete Covariates dpsfb950075.ps.tar = Enno MAMMEN J.S. MARRON: Mass Recentered Kernel Smoothers
1996cited by this paper
SEMIPARAMETRIC LEAST SQUARES (SLS) AND WEIGHTED SLS ESTIMATION OF SINGLE-INDEX MODELS
1993cited by this paper
Local minima and back propagation
1991cited by this paper
On Milman's inequality and random subspaces which escape through a mesh in ℝ n
1988cited by this paper

CITED BY

Robust Learning of a Group DRO Neuron
2026cites this paper
Interactive Learning of Single-Index Models via Stochastic Gradient Descent
2026cites this paper
The Optimal Condition Number for ReLU Function
2025influential citation
On the Convergence of (Stochastic) Gradient Descent for Kolmogorov–Arnold Networks
2025cites this paper
A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond
2025cites this paper
On learning Gaussian multi‐index models with gradient flow part I: General properties and two‐timescale learning
2025cites this paper
Joint Learning in the Gaussian Single Index Model
2025cites this paper
Efficient identification of wide shallow neural networks with biases
2025cites this paper
Agnostic Learning of Arbitrary ReLU Activation under Gaussian Marginals
2024cites this paper
Nonlinear tomographic reconstruction via nonsmooth optimization
2024cites this paper
Symmetric Matrix Completion with ReLU Sampling
2024cites this paper
Learning a Single Neuron Robustly to Distributional Shifts and Adversarial Label Noise
2024cites this paper
Masks, Signs, And Learning Rate Rewinding
2024cites this paper
On subdifferential chain rule of matrix factorization and beyond
2024cites this paper
Provably Learning a Multi-head Attention Layer
2024cites this paper
Inferring Change Points in High-Dimensional Regression via Approximate Message Passing
2024cites this paper
Sample and Computationally Efficient Robust Learning of Gaussian Single-Index Models
2024cites this paper
Improving the Convergence Rates of Forward Gradient Descent with Repeated Sampling
2024cites this paper
GLM Regression with Oblivious Corruptions
2023cites this paper
A faster and simpler algorithm for learning shallow networks
2023cites this paper
Learning Narrow One-Hidden-Layer ReLU Networks
2023cites this paper
Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron
2023cites this paper
Learning High-Dimensional Single-Neuron ReLU Networks with Finite Samples
2023cites this paper
Near-Optimal Cryptographic Hardness of Agnostically Learning Halfspaces and ReLU Regression under Gaussian Marginals
2023cites this paper
Finite-Sample Analysis of Learning High-Dimensional Single ReLU Neuron
2023cites this paper
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention
2023cites this paper
Robustly Learning a Single Neuron via Sharpness
2023cites this paper
Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer
2023cites this paper
Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models
2023cites this paper
Gradient Descent Provably Solves Nonlinear Tomographic Reconstruction
2023cites this paper
Complex-valued Neurons Can Learn More but Slower than Real-valued Neurons via Gradient Descent
2023cites this paper
Physical Layer Authentication and Security Design in the Machine Learning Era
2023cites this paper
Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks
2023cites this paper
Distribution-Independent Regression for Generalized Linear Models with Oblivious Corruptions
2023cites this paper
Is Solving Graph Neural Tangent Kernel Equivalent to Training Graph Neural Network?
2023cites this paper
Gradient-Based Feature Learning under Structured Data
2023cites this paper
Exploring Gradient Oscillation in Deep Neural Network Training
2023cites this paper
Max-affine regression via first-order methods
2023cites this paper
Gradient Descent Finds the Global Optima of Two-Layer Physics-Informed Neural Networks
2023cites this paper
On Single Index Models beyond Gaussian Data
2023cites this paper
Quadproj: a Python package for projecting onto quadratic hypersurfaces
2022cites this paper
Agnostic Learning of General ReLU Activation Using Gradient Descent
2022cites this paper
Complexity from Adaptive-Symmetries Breaking: Global Minima in the Statistical Mechanics of Deep Neural Networks
2022cites this paper
Algorithms for Efficiently Learning Low-Rank Neural Networks
2022influential citation
Benign Overfitting in Two-layer Convolutional Neural Networks
2022cites this paper
Learning a Single Neuron for Non-monotonic Activation Functions
2022influential citation
The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning
2022cites this paper
Hardness of Learning a Single Neuron with Adversarial Label Noise
2022cites this paper
Learning a Single Neuron with Adversarial Label Noise via Gradient Descent
2022cites this paper
Neural Networks can Learn Representations with Gradient Descent
2022cites this paper
A neuron-wise subspace correction method for the finite neuron method
2022cites this paper
Finite Sample Identification of Wide Shallow Neural Networks with Biases
2022cites this paper
Learning Single-Index Models with Shallow Neural Networks
2022cites this paper
SQ Lower Bounds for Learning Single Neurons with Massart Noise
2022cites this paper
Towards Theoretically Inspired Neural Initialization Optimization
2022cites this paper
Overparameterized ReLU Neural Networks Learn the Simplest Model: Neural Isometry and Phase Transitions
2022cites this paper
Efficient Methods for Model Performance Inference
2022cites this paper
Quantifying the Benefit of Using Differentiable Learning over Tangent Kernels
2021influential citation
A Convergence Analysis of Gradient Descent on Graph Neural Networks
2021cites this paper
An Improved and Low-dimensional Fingerprint-based Localization Method in Collocated Massive MIMO-OFDM Systems
2021cites this paper
GradSign: Model Performance Inference with Theoretical Insights
2021cites this paper
ReLU Regression with Massart Noise
2021cites this paper
On the Cryptographic Hardness of Learning Single Periodic Neurons
2021cites this paper
Stable Recovery of Entangled Weights: Towards Robust Identification of Deep Neural Networks from Minimal Samples
2021cites this paper
An Accurate, Robust and Low Dimensionality Deep Learning Localization Approach in DM-MIMO Systems Based on RSS
2021cites this paper
Learning a Single Neuron with Bias Using Gradient Descent
2021cites this paper
Toward Understanding the Feature Learning Process of Self-supervised Contrastive Learning
2021cites this paper
Uniqueness and stability for the solution of a nonlinear least squares problem
2021cites this paper
Theoretical Exploration of Flexible Transmitter Model
2021cites this paper
No Spurious Solutions in Non-convex Matrix Sensing: Structure Compensates for Isometry
2021cites this paper
A Study of Neural Training with Iterative Non-Gradient Methods
2021cites this paper
Learning a deep convolutional neural network via tensor decomposition
2021cites this paper
A Study of Neural Training with Non-Gradient and Noise Assisted Gradient Methods
2020cites this paper
Agnostic Learning of a Single Neuron with Gradient Descent
2020cites this paper
The Effects of Mild Over-parameterization on the Optimization Landscape of Shallow ReLU Neural Networks
2020cites this paper
Near-Optimal SQ Lower Bounds for Agnostically Learning Halfspaces and ReLUs under Gaussian Marginals
2020cites this paper
From Boltzmann Machines to Neural Networks and Back Again
2020cites this paper
Learning Two-Layer Residual Networks with Nonparametric Function Estimation by Convex Programming
2020cites this paper
Generalized Leverage Score Sampling for Neural Networks
2020cites this paper
Towards a Mathematical Understanding of Neural Network-Based Machine Learning: what we know and what we don't
2020cites this paper
Provable Acceleration of Neural Net Training via Polyak's Momentum
2020cites this paper
Understanding How Over-Parametrization Leads to Acceleration: A case of learning a single teacher neuron
2020cites this paper
Learning Graph Neural Networks with Approximate Gradient Descent
2020cites this paper
Approximation Algorithms for Training One-Node ReLU Neural Networks
2020influential citation
Unfolded Algorithms for Deep Phase Retrieval
2020cites this paper
A Modular Analysis of Provable Acceleration via Polyak's Momentum: Training a Wide ReLU Network and a Deep Linear Network
2020cites this paper
Nonparametric Learning of Two-Layer ReLU Residual Units
2020cites this paper
Provable training of a ReLU gate with an iterative non-gradient algorithm
2020cites this paper
Discussion of: “Nonparametric regression using deep neural networks with ReLU activation function”
2020cites this paper
Role of sparsity and structure in the optimization landscape of non-convex matrix sensing
2020cites this paper
Learning a Single Neuron with Gradient Methods
2020cites this paper
Validating the Theoretical Foundations of Residual Networks through Experimental Testing
2020influential citation
Mean-Field Analysis of Two-Layer Neural Networks: Non-Asymptotic Rates and Generalization Bounds
2020cites this paper
Piecewise linear activations substantially shape the loss surfaces of neural networks
2020cites this paper
Nonlinearities in activations substantially shape the loss surfaces of neural networks
2020cites this paper
Approximation Schemes for ReLU Regression
2020cites this paper
Gradient Descent can Learn Less Over-parameterized Two-layer Neural Networks on Classification Problems
2019cites this paper
Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets
2019cites this paper
Elimination of All Bad Local Minima in Deep Learning
2019cites this paper
On Connected Sublevel Sets in Deep Learning
2019cites this paper