Parallelizing Linear Recurrent Neural Nets Over Sequence Length

Published 2017 in International Conference on Learning Representations

ABSTRACT

Recurrent neural networks (RNNs) are widely used to model sequential data but their non-linear dependencies between sequence elements prevent parallelizing training over sequence length. We show the training of RNNs with only linear sequential dependencies can be parallelized over the sequence length using the parallel scan algorithm, leading to rapid training on long sequences even with small minibatch size. We develop a parallel linear recurrence CUDA kernel and show that it can be applied to immediately speed up training and inference of several state of the art RNN architectures by up to 9x. We abstract recent work on linear RNNs into a new framework of linear surrogate RNNs and develop a linear surrogate model for the long short-term memory unit, the GILR-LSTM, that utilizes parallel linear recurrence. We extend sequence learning to new extremely long sequence regimes that were previously out of reach by successfully training a GILR-LSTM on a synthetic sequence classification task with a one million timestep dependency.

PUBLICATION RECORD

Publication year
2017
Venue
International Conference on Learning Representations
Publication date
2017-09-12
Fields of study
Computer Science
Identifiers
arXiv 1709.04057
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Convolutional Sequence to Sequence Learning
2017cited by this paper
Training RNNs as Fast as CNNs
2017cited by this paper
Quasi-Recurrent Neural Networks
2016cited by this paper
Strongly-Typed Recurrent Neural Networks
2016cited by this paper
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
2016cited by this paper
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
2016cited by this paper
WaveNet: A Generative Model for Raw Audio
2016cited by this paper
Neural Machine Translation in Linear Time
2016cited by this paper
Persistent RNNs: Stashing Recurrent Weights On-Chip
2016cited by this paper
Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades
2015cited by this paper
Deep Recurrent Q-Learning for Partially Observable MDPs
2015cited by this paper
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
2015cited by this paper
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
2014cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
Sequence to Sequence Learning with Neural Networks
2014cited by this paper
Understanding the difficulty of training deep feedforward neural networks
2010cited by this paper
The mnist database of handwritten digits
2005cited by this paper
Physionet: components of a new research resource for complex physiologic signals
2000cited by this paper
Long Short-Term Memory
1997cited by this paper
On Parallel Prefix Computation
1994cited by this paper
Prefix sums and their applications
1990cited by this paper

CITED BY

BiSGR-Att: A sequence recommendation model based on bidirectional simplified gated recurrent network with linear attention
2026cites this paper
Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments
2026cites this paper
Parallelizable Neural Turing Machines
2026cites this paper
Learning State-Tracking from Code Using Linear RNNs
2026cites this paper
Parallel Training in Spiking Neural Networks
2026cites this paper
ParalESN: Enabling parallel information processing in Reservoir Computing
2026cites this paper
Parallelizable memory recurrent units
2026cites this paper
Predictability Enables Parallelization of Nonlinear State Space Models
2025cites this paper
DiffVox: A Differentiable Model for Capturing and Analysing Vocal Effects Distributions
2025cites this paper
Fast weight programming and linear transformers: from machine learning to neurobiology
2025cites this paper
Prototype-Driven Structure Synergy Network for Remote Sensing Images Segmentation
2025cites this paper
Enhancing Industrial Soft Sensing via Optimized Message Passing in Spatial-Temporal Graph Neural Network
2025cites this paper
Minimal Convolutional RNNs Accelerate Spatiotemporal Learning
2025cites this paper
WTxGRN: Wavelet Transform-Based Extended Gated Recurrent Network for Palm Vein Recognition
2025cites this paper
HydraMamba: Multi-Head State Space Model for Global Point Cloud Learning
2025cites this paper
Knowing When to Quit: Probabilistic Early Exits for Speech Separation
2025cites this paper
mGRADE: Minimal Recurrent Gating Meets Delay Convolutions for Lightweight Sequence Modeling
2025cites this paper
DeltaProduct: Increasing the Expressivity of DeltaNet Through Products of Householders
2025cites this paper
Dual-Branch Network for Spatial–Channel Stream Modeling Based on the State-Space Model for Remote Sensing Image Segmentation
2025cites this paper
Advancing Streaming ASR with Chunk-wise Attention and Trans-chunk Selective State Spaces
2025cites this paper
Neural Dynamics Model for Temperature Estimation of Permanent Magnet Synchronous Motor
2025cites this paper
Fixed-Point RNNs: From Diagonal to Dense in a Few Iterations
2025influential citation
DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products
2025cites this paper
Resona: Improving Context Copying in Linear Recurrence Models with Retrieval
2025cites this paper
Bidirectional Linear Recurrent Models for Sequence-Level Multisource Fusion
2025cites this paper
DiffVox: A Differentiable Model for Capturing and Analysing Professional Effects Distributions
2025cites this paper
Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access
2025cites this paper
Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model Acceleration
2025cites this paper
Algorithm-Hardware Co-Design for Ultra-Low-Power Large Language Models
2025cites this paper
Back to recurrent processing at the crossroad of transformers and state-space models
2025cites this paper
Maximizing Asynchronicity in Event-based Neural Networks
2025cites this paper
Learning to Dissipate Energy in Oscillatory State-Space Models
2025influential citation
Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Models
2025cites this paper
How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation
2025cites this paper
Parallelization of Non-linear State-Space Models: Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling
2025cites this paper
Revisiting Bi-Linear State Transitions in Recurrent Neural Networks
2025cites this paper
TRIS-HAR: Transmissive Reconfigurable Intelligent Surfaces-Assisted Human Activity Recognition Using State Space Models
2025cites this paper
Uncovering the Computational Roles of Nonlinearity in Sequence Modeling Using Almost-Linear RNNs
2025cites this paper
Sequential-Parallel Duality in Prefix Scannable Models
2025cites this paper
Don't Pay Attention
2025cites this paper
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
2025cites this paper
EED-CL: Extended EEG Deformer with Contrastive Learning for Robust Emotion Recognition
2025cites this paper
MS-SSM: A Multi-Scale State Space Model for Efficient Sequence Modeling
2025cites this paper
Breaking the Memory Wall: Exact Analytical Differentiation via Tiled Operator-Space Evolution
2025cites this paper
Transformer Reconstructed with Dynamic Value Attention
2025cites this paper
Analysis of the use of Recurrent Neural Network for Multiple Frequency-Shift Keying signals decoding during Entry, Descent, and Landing phase of interplanetary missions
2025cites this paper
KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals
2025cites this paper
Sliding Window Recurrences for Sequence Models
2025cites this paper
GDKVM: Echocardiography Video Segmentation via Spatiotemporal Key-Value Memory with Gated Delta Rule
2025cites this paper
Deep Learning-Based Joint Uplink-Downlink CSI Acquisition for Next-Generation Upper Mid-Band Systems
2025cites this paper
Selective Rotary Position Embedding
2025cites this paper
Fixed-Point RNNs: Interpolating from Diagonal to Dense
2025influential citation
Accelerating Automatic Differentiation of Direct Form Digital Filters
2025cites this paper
CloudMamba: Grouped Selective State Spaces for Point Cloud Analysis
2025cites this paper
MossNet: Mixture of State-Space Experts is a Multi-Head Attention
2025cites this paper
TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting
2025cites this paper
SHAP Meets Tensor Networks: Provably Tractable Explanations with Parallelism
2025cites this paper
Microstructure sensitive recurrent neural network surrogate model of crystal plasticity
2025cites this paper
Similarity-Aware Selective State-Space Modeling for Semantic Correspondence
2025cites this paper
A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems
2025influential citation
Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models
2025cites this paper
State-space modeling in long sequence processing: a survey on recurrence in the transformer era
2025cites this paper
Sound Matching an Analogue Levelling Amplifier Using the Newton-Raphson Method
2025cites this paper
Elucidating the Design Space of Decay in Linear Attention
2025cites this paper
RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling
2025cites this paper
Revisiting associative recall in modern recurrent models
2025cites this paper
Parallelizing MCMC Across the Sequence Length
2025cites this paper
Monocular Visual Place Recognition in LiDAR Maps via Cross-Modal State Space Model and Multi-View Matching
2024cites this paper
VMamba: Visual State Space Model
2024cites this paper
Investigating Recurrent Transformers with Dynamic Halt
2024cites this paper
On the Resurgence of Recurrent Models for Long Sequences - Survey and Research Opportunities in the Transformer Era
2024cites this paper
Recurrent Reinforcement Learning with Memoroids
2024cites this paper
Theoretical Foundations of Deep Selective State-Space Models
2024cites this paper
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
2024cites this paper
The Hidden Attention of Mamba Models
2024cites this paper
Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling
2024cites this paper
On the low-shot transferability of [V]-Mamba
2024cites this paper
RankMamba: Benchmarking Mamba's Document Ranking Performance in the Era of Transformers
2024cites this paper
MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection
2024cites this paper
Linear Attention Sequence Parallelism
2024cites this paper
Softmax Attention with Constant Cost per Token
2024cites this paper
Does Transformer Interpretability Transfer to RNNs?
2024cites this paper
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
2024cites this paper
RankMamba, Benchmarking Mamba's Document Ranking Performance in the Era of Transformers
2024cites this paper
HGRN2: Gated Linear RNNs with State Expansion
2024cites this paper
State-Free Inference of State-Space Models: The Transfer Function Approach
2024cites this paper
Mamba-Reg: Vision Mamba Also Needs Registers
2024cites this paper
Explaining Modern Gated-Linear RNNs via a Unified Implicit Attention Formulation
2024cites this paper
Unlocking the Secrets of Linear Complexity Sequence Model from A Unified Perspective
2024cites this paper
You Only Scan Once: Efficient Multi-dimension Sequential Modeling with LightNet
2024cites this paper
LongSSM: On the Length Extension of State-space Models in Language Modelling
2024cites this paper
Chimera: Effectively Modeling Multivariate Time Series with 2-Dimensional State Space Models
2024cites this paper
Parallelizing Linear Transformers with the Delta Rule over Sequence Length
2024cites this paper
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
2024cites this paper
Behavior-Dependent Linear Recurrent Units for Efficient Sequential Recommendation
2024cites this paper
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers
2024cites this paper
Temporally Multi-Scale Sparse Self-Attention for Physical Activity Data Imputation
2024cites this paper
Towards Scalable and Stable Parallelization of Nonlinear RNNs
2024cites this paper
Real-Time Recurrent Learning using Trace Units in Reinforcement Learning
2024cites this paper
Gated Slot Attention for Efficient Linear-Time Sequence Modeling
2024cites this paper