Successfully and efficiently training deep multi-layer perceptrons with logistic activation function simply requires initializing the weights with an appropriate negative mean

Published 2022 in Neural Networks

ABSTRACT

The vanishing gradient problem (i.e., gradients prematurely becoming extremely small during training, thereby effectively preventing a network from learning) is a long-standing obstacle to the training of deep neural networks using sigmoid activation functions when using the standard back-propagation algorithm. In this paper, we found that an important contributor to the problem is weight initialization. We started by developing a simple theoretical model showing how the expected value of gradients is affected by the mean of the initial weights. We then developed a second theoretical model that allowed us to identify a sufficient condition for the vanishing gradient problem to occur. Using these theories we found that initial back-propagation gradients do not vanish if the mean of the initial weights is negative and inversely proportional to the number of neurons in a layer. Numerous experiments with networks with 10 and 15 hidden layers corroborated the theoretical predictions: If we initialized weights as indicated by the theory, the standard back-propagation algorithm was both highly successful and efficient at training deep neural networks using sigmoid activation functions.

PUBLICATION RECORD

Publication year
2022
Venue
Neural Networks
Publication date
2022-06-01
Fields of study
Medicine, Computer Science
Identifiers
DOI 10.1016/j.neunet.2022.05.030 PMID 35714424
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

A new approach for the vanishing gradient problem on sigmoid activation
2020cited by this paper
ReLTanh: An activation function with vanishing gradient resistance for SAE-based DNNs and its application to rotating machinery fault diagnosis
2019cited by this paper
PMLB: a large benchmark suite for machine learning evaluation and comparison
2017cited by this paper
On weight initialization in deep neural networks
2017cited by this paper
Hyperbolic linear units for deep convolutional neural networks
2016cited by this paper
Incorporating Nesterov Momentum into Adam
2016cited by this paper
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
2015cited by this paper
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
2015influential reference
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
2015cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
Rectifier Nonlinearities Improve Neural Network Acoustic Models
2013cited by this paper
On the difficulty of training recurrent neural networks
2012influential reference
ADADELTA: An Adaptive Learning Rate Method
2012cited by this paper
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
2011cited by this paper
Understanding the difficulty of training deep feedforward neural networks
2010influential reference
Rectified Linear Units Improve Restricted Boltzmann Machines
2010cited by this paper
Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks
2006cited by this paper
A Fast Learning Algorithm for Deep Belief Nets
2006cited by this paper
Greedy Layer-Wise Training of Deep Networks
2006cited by this paper
Extreme learning machine: Theory and applications
2006cited by this paper
Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication
2004cited by this paper
Learning capability and storage capacity of two-hidden-layer feedforward networks
2003cited by this paper
Attempting to reduce the vanishing gradient effect through a novel recurrent multiscale architecture
2003cited by this paper
Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies
2001cited by this paper
The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions
1998cited by this paper
Capabilities of a four-layered feedforward neural network: four layers versus three
1997cited by this paper
Learning long-term dependencies with gradient descent is difficult
1994cited by this paper
Multilayer feedforward networks with a nonpolynomial activation function can approximate any function
1993cited by this paper
A direct adaptive method for faster backpropagation learning: the RPROP algorithm
1993cited by this paper
Approximation capabilities of multilayer feedforward networks
1991cited by this paper
Backpropagation Through Time: What It Does and How to Do It
1990cited by this paper
Neural networks for bond rating improved by multiple hidden layers
1990cited by this paper
Multilayer feedforward networks are universal approximators
1989cited by this paper
Approximation by superpositions of a sigmoidal function
1989cited by this paper
Learning representations by back-propagating errors
1986cited by this paper

CITED BY

AHIR: Deep learning-based autoencoder hashing image retrieval
2026cites this paper
Artificial Neural Network-Driven Techno-Economic Predictions for Micro Gas Turbines (MGT) Based Energy Applications
2025cites this paper
Beyond Gaussian Initializations: Signal Preserving Weight Initialization for Odd-Sigmoid Activations
2025cites this paper
Knowledge-guided temperature correction method for soluble solids content detection of watermelon based on Vis/NIR spectroscopy
2025cites this paper
Application of Predictive Modeling and Molecular Simulations to Elucidate the Mechanisms Underlying the Antimicrobial Activity of Sage (Salvia officinalis L.) Components in Fresh Cheese Production
2025influential citation
Physics-informed digital twin design for supporting the selection of process settings in continuous manufacturing, with a focus in fiberboard production
2025cites this paper
Strategies for overcoming data scarcity, imbalance, and feature selection challenges in machine learning models for predictive maintenance
2024cites this paper
Residual-connected physics-informed neural network for anti-noise wind field reconstruction
2024cites this paper
A study on improving energy flexibility in building engineering through generalized prediction models: Enhancing local bearing capacity of concrete for engineering structures
2024cites this paper
Entropy-based deep neural network training optimization for optical coherence tomography imaging
2024cites this paper
Neurocontrol for fixed-length trajectories in environments with soft barriers
2024cites this paper
Prediction of California bearing ratio and modified proctor parameters using deep neural networks and multiple linear regression: A case study of granular soils
2023cites this paper
The effect of amplitude modification in S-shaped activation functions on neural network regression
2023cites this paper
Implementation of deep neural networks and statistical methods to predict the resilient modulus of soils
2023cites this paper
Incorrect Application of Yilmaz-Poli (2022) Initialisation Method in dePater-Mitici 2023 paper entitled "A mathematical framework for improved weight initialization of neural networks using Lagrange multipliers"
2023cites this paper
A mathematical framework for improved weight initialization of neural networks using Lagrange multipliers
2023cites this paper
Target detection based on a new triple activation function
2022cites this paper