Hardware-Oriented Compression of Long Short-Term Memory for Efficient Inference

Published 2018 in IEEE Signal Processing Letters

ABSTRACT

Long short-term memory (LSTM) and its variants have been widely adopted in processing sequential data. However, the intrinsic large memory requirement and high computational complexity make it hard to be employed in embedded systems. This incurs the need of model compression and dedicated hardware accelerator for LSTM. In this letter, efficient clipped gating and top-<inline-formula><tex-math notation="LaTeX">$k$</tex-math></inline-formula> pruning schemes are introduced to convert the dense matrix computations in LSTM into structured sparse-matrix-sparse-vector multiplications. Then, mixed quantization schemes are developed to eliminate most of the multiplications in LSTM. The proposed compression scheme is well suited for efficient hardware implementations. Experimental results show that the model size and the number of matrix operations can be reduced by <inline-formula><tex-math notation="LaTeX">$32\times$ </tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$18.5\times$</tex-math></inline-formula>, respectively, at a cost of less than <inline-formula><tex-math notation="LaTeX">$1\%$</tex-math></inline-formula> accuracy loss on a word-level language modeling task.

PUBLICATION RECORD

Publication year
2018
Venue
IEEE Signal Processing Letters
Publication date
2018-05-14
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.1109/LSP.2018.2834872
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

QUEST: A 7.49TOPS multi-purpose log-quantized DNN inference engine stacked on 96MB 3D SRAM using inductive-coupling technology in 40nm CMOS
2018cited by this paper
To prune, or not to prune: exploring the efficacy of pruning for model compression
2017cited by this paper
Hardware-software codesign of accurate, multiplier-free Deep Neural Networks
2017cited by this paper
Compressing recurrent neural network with tensor train
2017cited by this paper
Learning Intrinsic Sparse Structures within Long Short-term Memory
2017cited by this paper
LogNet: Energy-efficient neural networks using logarithmic computation
2017cited by this paper
Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights
2017cited by this paper
Accelerating Recurrent Neural Networks: A Memory-Efficient Approach
2017cited by this paper
Structured sparse ternary weight coding of deep neural networks for efficient hardware implementations
2017cited by this paper
Conditional Computation in Deep and Recurrent Neural Networks
2016cited by this paper
Delta Networks for Optimized Recurrent Network Computation
2016cited by this paper
From model to FPGA: Software-hardware co-design for efficient neural network acceleration
2016cited by this paper
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
2016cited by this paper
Convolutional Neural Networks using Logarithmic Data Representation
2016cited by this paper
Learning compact recurrent neural networks
2016cited by this paper
Deep Learning
2016cited by this paper
Variable Computation in Recurrent Neural Networks
2016cited by this paper
Effective Quantization Methods for Recurrent Neural Networks
2016cited by this paper
Compression of Neural Machine Translation Models via Pruning
2016cited by this paper
ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA
2016influential reference
Recurrent Neural Networks With Limited Numerical Precision
2016cited by this paper
Fixed-point performance analysis of recurrent neural networks
2015cited by this paper
Recurrent Neural Network Regularization
2014cited by this paper
Sequence to Sequence Learning with Neural Networks
2014cited by this paper
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
2014cited by this paper
Speech recognition with deep recurrent neural networks
2013cited by this paper
Recurrent nets that time and count
2000cited by this paper
Building a Large Annotated Corpus of English: The Penn Treebank
1993cited by this paper

CITED BY

2QGRU: Power-of-Two Quantization for Efficient FPGA-Based Gated Recurrent Unit Architectures
2026cites this paper
LSTM-Based Model Compression for CAN Security in Intelligent Vehicles
2024cites this paper
Meta-Generalization for Multiparty Privacy Learning to Identify Anomaly Multimedia Traffic in Graynet
2022influential citation
Stage-Wise Magnitude-Based Pruning for Recurrent Neural Networks
2022cites this paper
Customized Hardware for Long-Short Term Memory Networks in Embedded Systems
2020cites this paper
A survey On hardware accelerators and optimization techniques for RNNs
2020cites this paper
E-LSTM: An Efficient Hardware Architecture for Long Short-Term Memory
2019cites this paper
One-Shot Pruning of Recurrent Neural Networks by Jacobian Spectrum Evaluation
2019cites this paper
Text synthesis from keywords: a comparison of recurrent-neural-network-based architectures and hybrid approaches
2019cites this paper
Structured Pruning of Recurrent Neural Networks through Neuron Selection
2019cites this paper
Hardware aspects of Long Short Term Memory
2018influential citation