When is a Convolutional Filter Easy To Learn?

Published 2017 in International Conference on Learning Representations

ABSTRACT

We analyze the convergence of (stochastic) gradient descent algorithm for learning a convolutional filter with Rectified Linear Unit (ReLU) activation function. Our analysis does not rely on any specific form of the input distribution and our proofs only use the definition of ReLU, in contrast with previous works that are restricted to standard Gaussian input. We show that (stochastic) gradient descent with random initialization can learn the convolutional filter in polynomial time and the convergence rate depends on the smoothness of the input distribution and the closeness of patches. To the best of our knowledge, this is the first recovery guarantee of gradient-based algorithms for convolutional filter on non-Gaussian input distributions. Our theory also justifies the two-stage learning rate strategy in deep neural networks. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.

PUBLICATION RECORD

Publication year
2017
Venue
International Conference on Learning Representations
Publication date
2017-09-18
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1709.06129
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

How to Escape Saddle Points Efficiently
2017cited by this paper
Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks
2017cited by this paper
Failures of Gradient-Based Deep Learning
2017cited by this paper
An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis
2017cited by this paper
Convergence Analysis of Two-layer Neural Networks with ReLU Activation
2017influential reference
Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods
2017cited by this paper
Learning Depth-Three Neural Networks in Polynomial Time
2017cited by this paper
Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs
2017influential reference
Weight Sharing is Crucial to Succesful Optimization
2017cited by this paper
Learning ReLUs via Gradient Descent
2017influential reference
The Loss Surface of Deep and Wide Neural Networks
2017cited by this paper
Gradient Descent Can Take Exponential Time to Escape Saddle Points
2017cited by this paper
Learning Neural Networks with Two Nonlinear Layers in Polynomial Time
2017cited by this paper
Recovery Guarantees for One-hidden-layer Neural Networks
2017influential reference
The Landscape of Deep Learning Algorithms
2017cited by this paper
Deep Learning without Poor Local Minima
2016cited by this paper
Distribution-Specific Hardness of Learning Neural Networks
2016cited by this paper
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
2016cited by this paper
Mastering the game of Go with deep neural networks and tree search
2016cited by this paper
On the Expressive Power of Deep Neural Networks
2016cited by this paper
Gradient Descent Only Converges to Minimizers
2016cited by this paper
Reliably Learning the ReLU in Polynomial Time
2016cited by this paper
Globally Optimal Training of Generalized Polynomial Neural Networks with Nonlinear Spectral Methods
2016cited by this paper
The landscape of empirical risk for nonconvex losses
2016cited by this paper
Language Modeling with Gated Convolutional Networks
2016cited by this paper
Topology and Geometry of Half-Rectified Network Optimization
2016cited by this paper
Identity Matters in Deep Learning
2016cited by this paper
V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation
2016cited by this paper
Learning Halfspaces and Neural Networks with Random Initialization
2015cited by this paper
ℓ1-regularized Neural Networks are Improperly Learnable in Polynomial Time
2015cited by this paper
On the Quality of the Initial Basin in Overspecified Neural Networks
2015cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition
2015cited by this paper
Global Optimality in Tensor Factorization, Deep Learning, and Beyond
2015cited by this paper
An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization
2014cited by this paper
The Loss Surfaces of Multilayer Networks
2014cited by this paper
Provable Methods for Training Neural Networks with Sparse Connectivity
2014cited by this paper
On the Computational Efficiency of Training Neural Networks
2014cited by this paper
Network In Network
2013cited by this paper
ImageNet classification with deep convolutional neural networks
2012cited by this paper
Training a Single Sigmoidal Neuron Is Hard
2002cited by this paper
Original Contribution: Training a 3-node neural network is NP-complete
1992influential reference

CITED BY

Global Convergence and Rich Feature Learning in L-Layer Infinite-Width Neural Networks under μP Parametrization
2025cites this paper
Efficient identification of wide shallow neural networks with biases
2025cites this paper
How Does Promoting the Minority Fraction Affect Generalization? A Theoretical Study of One-Hidden-Layer Neural Network on Group Imbalance
2024cites this paper
Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer
2023cites this paper
Learning High-Dimensional Single-Neuron ReLU Networks with Finite Samples
2023cites this paper
Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron
2023cites this paper
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention
2023cites this paper
ML-GUIDED OPTIMIZATION
2023cites this paper
Finite-Sample Analysis of Learning High-Dimensional Single ReLU Neuron
2023influential citation
Magnitude and Angle Dynamics in Training Single ReLU Neurons
2022influential citation
Linear RNNs Provably Learn Linear Dynamic Systems
2022cites this paper
Benign Overfitting in Two-layer Convolutional Neural Networks
2022cites this paper
Finite Sample Identification of Wide Shallow Neural Networks with Biases
2022cites this paper
How Does a Deep Learning Model Architecture Impact Its Privacy?
2022cites this paper
Learning and generalization of one-hidden-layer neural networks, going beyond standard Gaussian data
2022cites this paper
An optimization and generalization analysis for max-pooling networks
2021cites this paper
Parameter identifiability of a deep feedforward ReLU neural network
2021cites this paper
A Convergence Analysis of Gradient Descent on Graph Neural Networks
2021cites this paper
On the Provable Generalization of Recurrent Neural Networks
2021cites this paper
ReLU Regression with Massart Noise
2021cites this paper
Structured Directional Pruning via Perturbation Orthogonal Projection
2021cites this paper
Learning a Single Neuron with Bias Using Gradient Descent
2021cites this paper
Learning a deep convolutional neural network via tensor decomposition
2021cites this paper
From Local Pseudorandom Generators to Hardness of Learning
2021cites this paper
Stable Recovery of Entangled Weights: Towards Robust Identification of Deep Neural Networks from Minimal Samples
2021cites this paper
On the Inductive Bias of a CNN for Orthogonal Patterns Distributions
2020cites this paper
Guaranteed Convergence of Training Convolutional Neural Networks via Accelerated Gradient Descent
2020cites this paper
Mean-Field Analysis of Two-Layer Neural Networks: Non-Asymptotic Rates and Generalization Bounds
2020cites this paper
On the Global Convergence of Training Deep Linear ResNets
2020influential citation
Is the Skip Connection Provable to Reform the Neural Network Loss Landscape?
2020cites this paper
Approximation Algorithms for Training One-Node ReLU Neural Networks
2020cites this paper
Learning Graph Neural Networks with Approximate Gradient Descent
2020influential citation
Learning Deep ReLU Networks Is Fixed-Parameter Tractable
2020cites this paper
Improved Linear Convergence of Training CNNs With Generalizability Guarantees: A One-Hidden-Layer Case
2020cites this paper
Low-rank regularization and solution uniqueness in over-parameterized matrix sensing
2020cites this paper
Fast Learning of Graph Neural Networks with Guaranteed Generalizability: One-hidden-layer Case
2020cites this paper
Validating the Theoretical Foundations of Residual Networks through Experimental Testing
2020influential citation
Provable training of a ReLU gate with an iterative non-gradient algorithm
2020cites this paper
Learning a Single Neuron with Gradient Methods
2020cites this paper
Directional Pruning of Deep Neural Networks
2020cites this paper
Optimization Theory for ReLU Neural Networks Trained with Normalization Layers
2020influential citation
Hardness of Learning Neural Networks with Natural Weights
2020cites this paper
Agnostic Learning of a Single Neuron with Gradient Descent
2020influential citation
L EARNING O NE - HIDDEN - LAYER N EURAL N ETWORKS ON G AUSSIAN M IXTURE M ODELS WITH G UARAN - TEED G ENERALIZABILITY
2020cites this paper
Convergence of End-to-End Training in Deep Unsupervised Contrasitive Learning
2020cites this paper
Gradient Descent for Non-convex Problems in Modern Machine Learning
2019cites this paper
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
2019cites this paper
Width Provably Matters in Optimization for Deep Linear Neural Networks
2019influential citation
A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks
2019influential citation
Theory III: Dynamics and Generalization in Deep Networks1
2019cites this paper
Learning Two layer Networks with Multinomial Activation and High Thresholds
2019cites this paper
Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks
2019influential citation
On the Learnability of Deep Random Networks
2019cites this paper
Theory III: Dynamics and Generalization in Deep Networks
2019cites this paper
Memo No . 90 April 11 , 2019 Theory III : Dynamics and Generalization in Deep Networks-a simple solution 1
2019cites this paper
Convergence of a Relaxed Variable Splitting Coarse Gradient Descent Method for Learning Sparse Weight Binarized Activation Neural Network
2019cites this paper
First-order methods almost always avoid strict saddle points
2019cites this paper
Convergence Analyses of Online ADAM Algorithm in Convex Setting and Two-Layer ReLU Neural Network
2019cites this paper
Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks
2019cites this paper
Recursive Sketches for Modular Deep Learning
2019cites this paper
An Improved Analysis of Training Over-parameterized Deep Neural Networks
2019cites this paper
L G ] 2 8 M ay 2 01 9 Gradient Descent Finds Global Minima of Deep Neural Networks
2019cites this paper
Memo No . 90 July 18 , 2019 Theory III : Dynamics and Generalization in Deep Networks 1
2019cites this paper
Memo No . 90 August 6 , 2019 Theory III : Dynamics and Generalization in Deep Networks 1
2019cites this paper
Theoretical Issues in Deep Networks : Approximation , Optimization and Generalization
2019cites this paper
Memo No . 90 March 3 , 2019 Theory III : Dynamics and Generalization in Deep Networks 1
2019cites this paper
Memo No . 90 September 8 , 2019 Theory III : Dynamics and Generalization in Deep Networks 1
2019cites this paper
Provable Non-linear Inductive Matrix Completion
2019cites this paper
Globally optimal score-based learning of directed acyclic graphs in high-dimensions
2019cites this paper
Tight Sample Complexity of Learning One-hidden-layer Convolutional Neural Networks
2019cites this paper
Local Geometry of Cross Entropy Loss in Learning One-Hidden-Layer Neural Networks
2019cites this paper
Generalization Bounds for Convolutional Neural Networks
2019cites this paper
Stationary Points of Shallow Neural Networks with Quadratic Activation Function
2019cites this paper
Theoretical issues in deep networks
2019cites this paper
Convex Optimization for Shallow Neural Networks
2019cites this paper
Convergence Analysis of Neural Networks
2019cites this paper
Gradient descent optimizes over-parameterized deep ReLU networks
2019cites this paper
Theory III: Dynamics and Generalization in Deep Networks.
2019cites this paper
3 A semi-rigorous theory of the optimization landscape of Deep Nets : Bezout theorem and Boltzman distribution
2019cites this paper
Local Geometry of One-Hidden-Layer Neural Networks for Logistic Regression
2018cites this paper
Convergence of a Relaxed Variable Splitting Method for Learning Sparse Neural Networks via ℓ 1, ℓ 0, and Transformed-ℓ 1 Penalties
2018cites this paper
On the Convergence , Generalization and Recovery Guarantees of Deep Neural Networks
2018influential citation
Recovering the Lowest Layer of Deep Networks with High Threshold Activations
2018cites this paper
How Many Samples are Needed to Estimate a Convolutional or Recurrent Neural Network
2018cites this paper
XPONENTIALLY VANISHING SUB-OPTIMAL LOCAL MINIMA IN MULTILAYER NEURAL NETWORKS
2018cites this paper
On the Benefit of Width for Neural Networks: Disappearance of Basins
2018cites this paper
A Provably Correct Algorithm for Deep Learning that Actually Works
2018cites this paper
Understanding the Loss Surface of Neural Networks for Binary Classification
2018cites this paper
On the Power of Over-parametrization in Neural Networks with Quadratic Activation
2018influential citation
Learning One Convolutional Layer with Overlapping Patches
2018cites this paper
Learning Compact Neural Networks with Regularization
2018cites this paper
Towards Understanding the Generalization Bias of Two Layer Convolutional Linear Classifiers with Gradient Descent
2018cites this paper
HE LOSS SURFACE AND EXPRESSIVITY OF DEEP CONVOLUTIONAL NEURAL NETWORKS
2018cites this paper
An Approximation Algorithm for training One-Node ReLU Neural Network
2018influential citation
Why do Larger Models Generalize Better? A Theoretical Perspective via the XOR Problem
2018cites this paper
Guaranteed Recovery of One-Hidden-Layer Neural Networks via Cross Entropy
2018cites this paper
Over-Parameterized Deep Neural Networks Have No Strict Local Minima For Any Continuous Activations
2018cites this paper
How Many Samples are Needed to Estimate a Convolutional Neural Network?
2018cites this paper
Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks
2018cites this paper
N ov 2 01 8 Gradient Descent Finds Global Minima of Deep Neural Networks
2018cites this paper