Long short-term memory (LSTM) and its variants have been widely adopted in processing sequential data. However, the intrinsic large memory requirement and high computational complexity make it hard to be employed in embedded systems. This incurs the need of model compression and dedicated hardware accelerator for LSTM. In this letter, efficient clipped gating and top-<inline-formula><tex-math notation="LaTeX">$k$</tex-math></inline-formula> pruning schemes are introduced to convert the dense matrix computations in LSTM into structured sparse-matrix-sparse-vector multiplications. Then, mixed quantization schemes are developed to eliminate most of the multiplications in LSTM. The proposed compression scheme is well suited for efficient hardware implementations. Experimental results show that the model size and the number of matrix operations can be reduced by <inline-formula><tex-math notation="LaTeX">$32\times$ </tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$18.5\times$</tex-math></inline-formula>, respectively, at a cost of less than <inline-formula><tex-math notation="LaTeX">$1\%$</tex-math></inline-formula> accuracy loss on a word-level language modeling task.
Hardware-Oriented Compression of Long Short-Term Memory for Efficient Inference
Zhisheng Wang,Jun Lin,Zhongfeng Wang
Published 2018 in IEEE Signal Processing Letters
ABSTRACT
PUBLICATION RECORD
- Publication year
2018
- Venue
IEEE Signal Processing Letters
- Publication date
2018-05-14
- Fields of study
Computer Science, Engineering
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-28 of 28 references · Page 1 of 1
CITED BY
Showing 1-11 of 11 citing papers · Page 1 of 1