On the difficulty of training recurrent neural networks

Razvan Pascanu,Tomas Mikolov,Yoshua Bengio

Published 2012 in International Conference on Machine Learning

ABSTRACT

There are two widely known issues with properly training recurrent neural networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. We validate empirically our hypothesis and proposed solutions in the experimental section.

PUBLICATION RECORD

Publication year
2012
Venue
International Conference on Machine Learning
Publication date
2012-11-21
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1211.5063
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Training Recurrent Neural Networks
2013cited by this paper
Advances in optimizing recurrent networks
2012cited by this paper
Theano: new features and speed improvements
2012cited by this paper
Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription
2012cited by this paper
Long Short-Term Memory in Echo State Networks: Details of a Simulation Study
2012cited by this paper
Statistical Language Models Based on Neural Networks
2012cited by this paper
Generating Text with Recurrent Neural Networks
2011cited by this paper
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
2011cited by this paper
Empirical Evaluation and Combination of Advanced Language Modeling Techniques
2011cited by this paper
Learning Recurrent Neural Networks with Hessian-Free Optimization
2011influential reference
A neurodynamical model for working memory
2011cited by this paper
On the training of recurrent neural networks
2011cited by this paper
SUBWORD LANGUAGE MODELING WITH NEURAL NETWORKS
2011influential reference
A Novel Connectionist System for Unconstrained Handwriting Recognition
2009cited by this paper
Reservoir computing approaches to recurrent neural network training
2009cited by this paper
2007 Special Issue: Optimization and applications of echo state networks with leaky- integrator neurons
2007cited by this paper
Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication
2004cited by this paper
New results on recurrent network training: unifying the algorithms and accelerating convergence
2000cited by this paper
Long Short-Term Memory
1997cited by this paper
Neural Networks with Adaptive Learning Rate and Momentum Terms
1995cited by this paper
Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry and Engineering
1995cited by this paper
Learning long-term dependencies with gradient descent is difficult
1994influential reference
The problem of learning long-term dependencies in recurrent networks
1993influential reference
Bifurcations of Recurrent Neural Networks in Gradient Descent Learning
1993influential reference
On the computational power of neural nets
1992cited by this paper
Adaptive Synchronization of Neural and Physical Oscillators
1991cited by this paper
Finding Structure in Time
1990cited by this paper
A Learning Algorithm for Continually Running Fully Recurrent Neural Networks
1989cited by this paper
Generalization of backpropagation with application to a recurrent gas market model
1988cited by this paper
Learning representations by back-propagating errors
1986cited by this paper

CITED BY

Early-warning the compact-to-dendritic transition via spatiotemporal learning of two-dimensional growth images
2026cites this paper
Transformer-Based Reinforcement Learning for Autonomous Orbital Collision Avoidance in Partially Observable Environments
2026cites this paper
Large language models for clinical artificial intelligence in healthcare a systematic review
2026cites this paper
Nonparametric prediction of ship maneuvering motions based on the heterogeneous integration model
2026cites this paper
CHLU: The Causal Hamiltonian Learning Unit as a Symplectic Primitive for Deep Learning
2026cites this paper
Adaptive Temporal Dynamics for Personalized Emotion Recognition: A Liquid Neural Network Approach
2026cites this paper
Forecasting Equity Correlations with Hybrid Transformer Graph Neural Network
2026cites this paper
Synthesizing Epileptic Seizures: Gaussian Processes for EEG Generation
2026cites this paper
How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs
2026cites this paper
Polynomial chaos expansion for operator learning
2026cites this paper
Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes
2026cites this paper
How Large Language Models Get Stuck: Early structure with persistent errors
2026cites this paper
Adaptive learning rate optimization in deep recurrent architectures for precision PM2.5 forecasting under climate variability.
2026cites this paper
Tracking Finite-Time Lyapunov Exponents to Robustify Neural ODEs
2026cites this paper
ST-RTNet: An energy-efficient spike temporal residual transformer network for rock segmentation in deep space exploration
2026cites this paper
Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers
2026cites this paper
Surrogate-assisted dynamic response prediction and intelligent fault management for gas–cooled reactor–Brayton energy systems
2026cites this paper
Stability and Generalization of Nonconvex Optimization with Heavy-Tailed Noise
2026cites this paper
Invertible Memory Flow Networks
2026cites this paper
Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization
2026cites this paper
Efficient, Equivariant Predictions of Distributed Charge Models
2026cites this paper
Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data
2026cites this paper
Why ReLU? A Bit-Model Dichotomy for Deep Network Training
2026cites this paper
Residual Koopman Spectral Profiling for Predicting and Preventing Transformer Training Instability
2026cites this paper
Adaptive Correlation-Weighted Intrinsic Rewards for Reinforcement Learning
2026cites this paper
When Does Margin Clamping Affect Training Variance? Dataset-Dependent Effects in Contrastive Forward-Forward Learning
2026cites this paper
The Volterra signature
2026cites this paper
Scalable multitask Gaussian processes for complex mechanical systems with functional covariates
2026cites this paper
Dynamic Compression Flows for Neuroscience Data
2026cites this paper
Tuning the burn-in phase in training recurrent neural networks improves their performance
2026cites this paper
Global renewable energy forecasting using hybrid ML/DL models: Economic and geospatial insights
2026cites this paper
Gated Inverted Recurrent Transformer for multivariate time series forecasting: A deep learning approach to predict KSTAR PF superconducting coil temperature
2026cites this paper
Inverse design of cement and cement paste composition for targeted tensile strength using sequential neural networks and hydration based micromechanical modelling
2026cites this paper
Ots-net: unlocking mechanistic interpretability in ECG arrhythmia classification
2026cites this paper
Hybrid residual reservoir computing using dynamic memristor for time series prediction
2026cites this paper
An automated slot stowage optimization method for container ship based on improved Actor-Critic algorithm
2026cites this paper
A comparative study on stochastic and AI-based approaches for predicting meteorological droughts
2026cites this paper
Breaking the Barriers of Molecular Dynamics With Deep‐Learning: Opportunities, Pitfalls, and How to Navigate Them
2026cites this paper
ParalESN: Enabling parallel information processing in Reservoir Computing
2026cites this paper
Grappa: Gradient-Only Communication for Scalable Graph Neural Network Training
2026cites this paper
DeXposure-FM: A Time-series, Graph Foundation Model for Credit Exposures and Stability on Decentralized Financial Networks
2026cites this paper
Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise
2026cites this paper
RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing
2026cites this paper
Risk-Sensitive Exponential Actor Critic
2026cites this paper
How Effective Is Mamba-Augmented Transformer for Stock Market Price Forecasting?
2026cites this paper
PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving
2026cites this paper
Predictive E-prop: A biologically inspired approach to train predictive coding-based recurrent spiking neural networks
2026cites this paper
Recurrent neural networks implemented through spatiotemporal light propagation in optical fibers
2026cites this paper
Cooperative-Competitive Team Play of Real-World Craft Robots
2026cites this paper
Stream Neural Networks: Epoch-Free Learning with Persistent Temporal State
2026cites this paper
Deep learning-based prediction of the hysteretic behavior of buckling-restrained braces for seismic design using analysis-of-mean-based optimal hyperparameters
2026cites this paper
Deep Learning Framework for Damage Prediction in Low-Velocity Impact
2026cites this paper
Real-time recognition and localization of human respiratory activities based on acoustic signals: facing to infectious disease prevention and control
2026cites this paper
Design of real and complex recurrent neural networks for sound source localisation.
2026cites this paper
Look Forward to Walk Backward: Efficient Terrain Memory for Backward Locomotion with Forward Vision
2026cites this paper
Coupled evolution of meteorological and hydrological drought until 2100 based on changes in climate scenarios.
2026cites this paper
Attention and Representation Learning in Byte-Level Digital Forensics: A Survey of Methods, Challenges, and Applications
2026cites this paper
Enhancing Solar Power Forecasting Accuracy Using HMPCS and Machine Learning Techniques: An Applied Study
2026cites this paper
Position: Why a Dynamical Systems Perspective is Needed to Advance Time Series Modeling
2026cites this paper
TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers
2026cites this paper
Improved state mixing in higher-order and block diagonal linear recurrent networks
2026cites this paper
Optimizing energy management in reconfigurable distribution networks: Integrating Hybrid Transformer-CNN with wind turbines and electric vehicles
2026cites this paper
A channel selection MLP-Mixer network for EEG-based motor imagery BCI
2026cites this paper
An EEMD-based LSTM method for reconstructing the attenuated interference signals in a laser doppler vibrometry system
2026cites this paper
Short-term high-volatility power load forecasting in smart port energy systems using FeatureGating-BiLSTM enhanced by DualAttention mechanisms
2026cites this paper
Bridging the safety-specific language model gap: Domain-adaptive pretraining of transformer-based models across several industrial sectors for occupational safety applications
2026cites this paper
A Robotized Steerable Catheter System for Cardiovascular Intervention With Enhanced Safety
2026cites this paper
Fine-grained space object classification with Convolution-Boosted LSTM using light curves: A new method and a large scale dataset
2026cites this paper
NeuroSSM: Multiscale Differential State-Space Modeling for Context-Aware fMRI Analysis
2026cites this paper
Variational (Energy-Based) Spectral Learning: A Machine Learning Framework for Solving Partial Differential Equations
2026cites this paper
Primate-informed neural network for visual decision-making
2026cites this paper
A Review of Online Diffusion Policy RL Algorithms for Scalable Robotic Control
2026cites this paper
A novel backpropagation algorithm based on negated kurtosis loss for training shallow, convolutional, and deep neural networks.
2026cites this paper
Incorporating patient history into the insulin sensitivity prediction in intensive care by feedforward neural network models
2026cites this paper
Discovering the Potential of Automated Phraseological Interference Error Detection: A Transformer-Based Approach
2026cites this paper
AGGC: Adaptive Group Gradient Clipping for Stabilizing Large Language Model Training
2026cites this paper
Automatic Stability and Recovery for Neural Network Training
2026cites this paper
Lameness detection in dairy cows using pose estimation and bidirectional LSTMs
2026cites this paper
Implementation of Near-Real-Time Satellite Data Retrieval for CO₂ Concentration Using an Enhanced Transformer Network
2026influential citation
SQUAD: Scalable Quorum Adaptive Decisions via ensemble of early exit neural networks
2026cites this paper
Parameter conditioned interpretable U-Net surrogate model for data-driven predictions of convection-diffusion-reaction processes
2026cites this paper
Temporal modeling with reversible transformers
2026cites this paper
Understanding vision transformer robustness through the lens of out-of-distribution detection
2026cites this paper
MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration
2026cites this paper
SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for Reinforcement Learning from Human Feedback (RLHF)
2026cites this paper
G3DCT: An Interpretable Spatial Grid-based Framework with Temporal Convolution-Transformer for EEG Artifact Identification
2026cites this paper
Non-intrusive reduced order modeling of fluid flows via finite element inspired graph neural network
2026cites this paper
Boundary-aware and multi-angle modeling-based object tracking in polarimetric images
2026cites this paper
ZClip: Adaptive Spike Mitigation for LLM Pre-Training
2025cites this paper
Refining Long-Term Predictions: Two-Stage Spatial-Temporal Feature Learning for 3D Human Motion Prediction
2025cites this paper
SECONDGRAM: Self-conditioned diffusion with gradient manipulation for longitudinal MRI imputation
2025cites this paper
Harnessing uncertainty when learning through Equilibrium Propagation in neural networks
2025cites this paper
Time‐varying parameters identification for dual‐control aircraft based on efficiency learnable extended Kalman filter
2025cites this paper
Identifying Sparsely Active Circuits Through Local Loss Landscape Decomposition
2025cites this paper
Forecasting Information Operations with Hybrid Transformer Architecture
2025cites this paper
New analytic formulas for memory and prediction functions in reservoir computers with time delays.
2025cites this paper
End-to-end data-driven weather prediction
2025cites this paper
Rank-Based Modeling for Universal Packets Compression in Multi-Modal Communications
2025cites this paper
UniBERT: adversarial training for language-universal representations
2025cites this paper
Fixed-Point RNNs: From Diagonal to Dense in a Few Iterations
2025cites this paper