Hebbian Descent: A Unified View on Log-Likelihood Learning

Abstract This study discusses the negative impact of the derivative of the activation functions in the output layer of artificial neural networks, in particular in continual learning. We propose Hebbian descent as a theoretical framework to overcome this limitation, which is implemented through an alternative loss function for gradient descent we refer to as Hebbian descent loss. This loss is effectively the generalized log-likelihood loss and corresponds to an alternative weight update rule for the output layer wherein the derivative of the activation function is disregarded. We show how this update avoids vanishing error signals during backpropagation in saturated regions of the activation functions, which is particularly helpful in training shallow neural networks and deep neural networks where saturating activation functions are only used in the output layer. In combination with centering, Hebbian descent leads to better continual learning capabilities. It provides a unifying perspective on Hebbian learning, gradient descent, and generalized linear models, for all of which we discuss the advantages and disadvantages. Given activation functions with strictly positive derivative (as often the case in practice), Hebbian descent inherits the convergence properties of regular gradient descent. While established pairings of loss and output layer activation function (e.g., mean squared error with linear or cross-entropy with sigmoid/softmax) are subsumed by Hebbian descent, we provide general insights for designing arbitrary loss activation function combinations that benefit from Hebbian descent. For shallow networks, we show that Hebbian descent outperforms Hebbian learning, has a performance similar to regular gradient descent, and has a much better performance than all other tested update rules in continual learning. In combination with centering, Hebbian descent implements a forgetting mechanism that prevents catastrophic interference notably better than the other tested update rules. When training deep neural networks, our experimental results suggest that Hebbian descent has better or similar performance as gradient descent.

On the importance of centering in artificial neural networks
2021influential reference
Hebbian-Descent
2019cited by this paper
A Brief Survey on Forgetting from a Knowledge Representation and Reasoning Perspective
2018cited by this paper
Learning One Convolutional Layer with Overlapping Patches
2018influential reference
Recent Advances in Recurrent Neural Networks
2017cited by this paper
Self-Normalizing Neural Networks
2017cited by this paper
Improving Deep Learning by Inverse Square Root Linear Units (ISRLUs)
2017cited by this paper
How to Center Deep Boltzmann Machines
2016influential reference
One-shot Learning with Memory-Augmented Neural Networks
2016cited by this paper
Overcoming catastrophic forgetting in neural networks
2016cited by this paper
Adaptive Switching Circuits
2016cited by this paper
Reducing Generation Cost in Transmission System using FACTS Devices
2015cited by this paper
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
2015cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
2015cited by this paper
Memory Storage Fidelity in the Hippocampal Circuit: The Role of Subregions and Input Statistics
2015cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
RECURRENT NEURAL NETWORKS
2014cited by this paper
Rectifier Nonlinearities Improve Neural Network Acoustic Models
2013cited by this paper
A Survey on Backpropagation Algorithms for Feedforward Neural Networks
2013cited by this paper
Deep Boltzmann Machines and the Centering Trick
2012cited by this paper
On the difficulty of training recurrent neural networks
2012cited by this paper
Efficient BackProp
2012influential reference
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
2011cited by this paper
Deep Sparse Rectifier Neural Networks
2011cited by this paper
Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
2011influential reference
Parallel tempering is efficient for learning restricted Boltzmann machines
2010cited by this paper
Parallel Tempering for Training of Restricted Boltzmann Machines
2010cited by this paper
Understanding the difficulty of training deep feedforward neural networks
2010cited by this paper
Tractable Multivariate Binary Density Estimation and the Restricted Boltzmann Forest
2010cited by this paper
Learning Multiple Layers of Features from Tiny Images
2009influential reference
Training restricted Boltzmann machines using approximations to the likelihood gradient
2008cited by this paper
Generalized Linear Models
2005cited by this paper
Avoiding catastrophic forgetting by coupling two reverberating neural networks
2004cited by this paper
Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network
2003cited by this paper
Fast convergence for backpropagation network with magnified gradient function
2003cited by this paper
Permitted and Forbidden Sets in Symmetric Threshold-Linear Networks
2003cited by this paper
Generalized Linear Models
2002influential reference
Training Products of Experts by Minimizing Contrastive Divergence
2002cited by this paper
An adaptive activation function for multilayer feedforward neural networks
2002cited by this paper
Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit
2000cited by this paper
Fast Convergent Generalized Back-Propagation Algorithm with Constant Learning Rate
1999cited by this paper
Catastrophic forgetting in connectionist networks.
1999cited by this paper
On the storage capacity of Hopfield models with correlated patterns
1998cited by this paper
Nonlinear backpropagation: doing backpropagation without derivatives of the activation function
1997influential reference
Pseudo-recurrent Connectionist Networks: An Approach to the 'Sensitivity-Stability' Dilemma
1997cited by this paper
Long Short-Term Memory
1997cited by this paper
Avoiding catastrophic forgetting by coupling two reverberating neural networks
1997cited by this paper
Centering Neural Network Gradient Factors
1996cited by this paper
Catastrophic Forgetting, Rehearsal and Pseudorehearsal
1995cited by this paper
Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.
1995cited by this paper
Learning by on-line gradient descent
1995cited by this paper
Original Contribution: An analysis of premature saturation in back propagation learning
1993cited by this paper
A direct adaptive method for faster backpropagation learning: the RPROP algorithm
1993cited by this paper
Enhanced backpropagation training algorithm for transient event identification
1993cited by this paper
Improving the convergence of the back-propagation algorithm
1992cited by this paper
Contrastive Hebbian Learning in the Continuous Hopfield Model
1991cited by this paper
Untersuchungen zu dynamischen neuronalen Netzen
1991cited by this paper
Perceptron-based learning algorithms
1990cited by this paper
Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.
1990cited by this paper
Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem
1989cited by this paper
Fast-learning variations on back propagation: an empirical study.
1989cited by this paper
Parallel Distributed Processing: Explorations in the Micro-structure of Cognition
1989cited by this paper
Connectionist Learning Procedures
1989influential reference
Optimal unsupervised learning in a single-layer linear feedforward neural network
1989cited by this paper
The Hebb Rule for Synaptic Plasticity: Algorithms and Implementations
1989influential reference
Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition
1989cited by this paper
Learning representations by back-propagating errors
1986cited by this paper
Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations
1986cited by this paper
A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications
1985cited by this paper
Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Visual Pattern Recognition
1982cited by this paper
Simplified neuron model as a principal component analyzer
1982cited by this paper
Theory for the development of neuron selectivity: orientation specificity and binocular interaction in visual cortex
1982cited by this paper
Simple memory: a theory for archicortex.
1971cited by this paper
Gradient Theory of Optimal Flight Paths
1960cited by this paper
The perceptron: a probabilistic model for information storage and organization in the brain.
1958cited by this paper

Hebbian Descent: A Unified View on Log-Likelihood Learning

ABSTRACT

PUBLICATION RECORD

CITATION MAP

EXTRACTION MAP

CLAIMS

CONCEPTS

REFERENCES

CITED BY