Frustratingly Short Attention Spans in Neural Language Modeling

Michal Daniluk,Tim Rocktäschel,Johannes Welbl,Sebastian Riedel

Published 2017 in International Conference on Learning Representations

ABSTRACT

Neural language models predict the next token using a latent representation of the immediate token history. Recently, various methods for augmenting neural language models with an attention mechanism over a differentiable memory have been proposed. For predicting the next token, these models query information from a memory of the recent history which can facilitate learning mid- and long-range dependencies. However, conventional attention mechanisms used in memory-augmented neural language models produce a single output vector per time step. This vector is used both for predicting the next token as well as for the key and value of a differentiable memory of a token history. In this paper, we propose a neural language model with a key-value attention mechanism that outputs separate representations for the key and value of a differentiable memory, as well as for encoding the next-word distribution. This model outperforms existing memory-augmented neural language models on two corpora. Yet, we found that our method mainly utilizes a memory of the five most recent output representations. This led to the unexpected main finding that a much simpler model based only on the concatenation of recent output representations from previous time steps is on par with more sophisticated memory-augmented neural language models.

PUBLICATION RECORD

Publication year
2017
Venue
International Conference on Learning Representations
Publication date
2017-02-15
Fields of study
Computer Science
Identifiers
arXiv 1702.04521
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Published as a conference paper at ICLR 2018 S IMULATING A CTION D YNAMICS WITH N EURAL P ROCESS N ETWORKS
2018cited by this paper
Recurrent Memory Networks for Language Modeling
2016influential reference
Natural Language Comprehension with the EpiReader
2016cited by this paper
Separating Answers from Queries for Neural Reading Comprehension
2016cited by this paper
Reference-Aware Language Models
2016cited by this paper
Higher Order Recurrent Neural Networks
2016influential reference
Key-Value Memory Networks for Directly Reading Documents
2016influential reference
Using Fast Weights to Attend to the Recent Past
2016influential reference
Consensus Attention-based Neural Networks for Chinese Reading Comprehension
2016cited by this paper
Long Short-Term Memory-Networks for Machine Reading
2016cited by this paper
Gated-Attention Readers for Text Comprehension
2016cited by this paper
Exploring the Limits of Language Modeling
2016cited by this paper
Attention-over-Attention Neural Networks for Reading Comprehension
2016cited by this paper
Text Understanding with the Attention Sum Reader Network
2016cited by this paper
Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes
2016influential reference
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
2015cited by this paper
End-To-End Memory Networks
2015cited by this paper
Sparse Non-negative Matrix Language Modeling
2015cited by this paper
Scaling recurrent neural network language models
2015cited by this paper
The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations
2015influential reference
A Neural Attention Model for Abstractive Sentence Summarization
2015cited by this paper
Attention-Based Models for Speech Recognition
2015cited by this paper
Reasoning about Entailment with Neural Attention
2015cited by this paper
Neural Programmer-Interpreters
2015cited by this paper
BlackOut: Speeding up Recurrent Neural Network Language Models With Very Large Vocabularies
2015cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
Memory Networks
2014cited by this paper
Neural Machine Translation by Jointly Learning to Align and Translate
2014influential reference
Neural Turing Machines
2014cited by this paper
One billion word benchmark for measuring progress in statistical language modeling
2013cited by this paper
On the difficulty of training recurrent neural networks
2012cited by this paper
Empirical Evaluation and Combination of Advanced Language Modeling Techniques
2011cited by this paper
Recurrent neural network based language model
2010cited by this paper
Long Short-Term Memory
1997cited by this paper
Backpropagation Through Time: What It Does and How to Do It
1990cited by this paper
Learning internal representations by error propagation
1986cited by this paper
Learning Matrices and Their Applications
1963cited by this paper
Pattern Recognition by Means of Automatic Analogue Apparatus
1960cited by this paper

CITED BY

Prompt-Aware Adapter: Learning Adaptive Visual Tokens for Multimodal Large Language Models
2026cites this paper
Cross-Subject Cognitive State Assessment for Unmanned System Operators Based on Brain Functional Connectivity
2025cites this paper
Applying social media in emergency response: an attention-based bidirectional deep learning system for location reference recognition in disaster tweets
2024cites this paper
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models
2024cites this paper
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance
2024cites this paper
Pruning Literals for Highly Efficient Explainability at Word Level
2024cites this paper
Automated Disentangled Sequential Recommendation with Large Language Models
2024cites this paper
Data Ambiguity Strikes Back: How Documentation Improves GPT's Text-to-SQL
2023cites this paper
ReelFramer: Human-AI Co-Creation for News-to-Video Translation
2023cites this paper
Utilizing social media for emergency response: a tweet classification system using attention-based BiLSTM and CNN for resource management
2023cites this paper
The Anatomy of Deception: Technical and Human Perspectives on a Large-scale Phishing Campaign
2023cites this paper
Concentration or distraction? A synergetic-based attention weights optimization method
2023cites this paper
Lost in the Middle: How Language Models Use Long Contexts
2023cites this paper
Role of Bias Terms in Dot-Product Attention
2023cites this paper
EEG-Based Emotion Recognition via Channel-Wise Attention and Self Attention
2023cites this paper
End-to-End Learning with Text & Knowledge Bases
2023cites this paper
Neural Language Models in Natural Language Processing
2023cites this paper
FAformer: parallel Fourier-attention architectures benefits EEG-based affective computing with enhanced spatial information
2023cites this paper
Predicting Patients' Satisfaction With Mental Health Drug Treatment Using Their Reviews: Unified Interchangeable Model Fusion Approach
2023cites this paper
LLM-TAKE: Theme-Aware Keyword Extraction Using Large Language Models
2023cites this paper
In-Context Exemplars as Clues to Retrieving from Large Associative Memory
2023cites this paper
Topic-aware hierarchical multi-attention network for text classification
2022cites this paper
State-Regularized Recurrent Neural Networks to Extract Automata and Explain Predictions
2022cites this paper
Segmentation and recognition of �led sweet pepper based on improved self-attention convolutional neural networks
2022cites this paper
A General Survey on Attention Mechanisms in Deep Learning
2022cites this paper
Segmentation and recognition of filed sweet pepper based on improved self-attention convolutional neural networks
2022cites this paper
A Knowledge Query Network Model Based on Rasch Model Embedding for Personalized Online Learning
2022cites this paper
Deep Feature Learning Based Fault Detection with High-Frequency Signals
2022cites this paper
Attention Biasing and Context Augmentation for Zero-Shot Control of Encoder-Decoder Transformers for Natural Language Generation
2022cites this paper
Attention-based bidirectional LSTM with embedding technique for classification of COVID-19 articles
2022cites this paper
The Health Gym: synthetic health-related datasets for the development of reinforcement learning algorithms
2022cites this paper
Artificial Intelligence for the Metaverse: A Survey
2022influential citation
Process data properties matter: Introducing gated convolutional neural networks (GCNN) and key-value-predict attention networks (KVP) for next event prediction with deep learning
2021cites this paper
A Survey on Aspect-Based Sentiment Classification
2021cites this paper
Embedding Graph Convolutional Networks in Recurrent Neural Networks for Predictive Monitoring
2021cites this paper
AutoAttend: Automated Attention Representation Search
2021cites this paper
Zero-Shot Controlled Generation with Encoder-Decoder Transformers
2021cites this paper
Memory-guided Unsupervised Image-to-image Translation
2021cites this paper
Attention, please! A survey of neural attention models in deep learning
2021cites this paper
Learning a Word-Level Language Model with Sentence-Level Noise Contrastive Estimation for Contextual Sentence Probability Estimation
2021cites this paper
NetBERT: A Pre-trained Language Representation Model for Computer Networking
2020cites this paper
A Survey of the Usages of Deep Learning for Natural Language Processing
2020cites this paper
A Fusion Model-Based Label Embedding and Self-Interaction Attention for Text Classification
2020cites this paper
Span-Based Neural Buffer: Towards Efficient and Effective Utilization of Long-Distance Context for Neural Sequence Models
2020influential citation
Ein Vergleich aktueller Deep-Learning-Architekturen zur Prognose von Prozessverhalten
2020cites this paper
Do Transformers Need Deep Long-Range Memory?
2020cites this paper
Stock Embeddings Acquired from News Articles and Price History, and an Application to Portfolio Optimization
2020cites this paper
Human Attention Maps for Text Classification: Do Humans and Neural Networks Focus on the Same Words?
2020cites this paper
Label-Attentive Hierarchical Attention Network for Text Classification
2020cites this paper
Neural Language Generation: Formulation, Methods, and Evaluation
2020cites this paper
Hopfield Networks is All You Need
2020cites this paper
O N M EMORY IN H UMAN AND A RTIFICIAL L ANGUAGE P ROCESSING S YSTEMS
2020cites this paper
Modeling word and morpheme order in natural language as an efficient trade-off of memory and surprisal.
2020cites this paper
A Tale of Two Linkings: Dynamically Gating between Schema Linking and Structural Linking for Text-to-SQL Parsing
2020cites this paper
Long Range Arena: A Benchmark for Efficient Transformers
2020cites this paper
Sequential transfer learning in NLP for text summarization
2019cites this paper
Go From the General to the Particular: Multi-Domain Translation with Domain Transformation Networks
2019cites this paper
AutoFM: A hybrid collaborative filtering model with denoising autoencoders and factorization machine
2019cites this paper
Lightweight and Efficient Neural Natural Language Processing with Quaternion Networks
2019cites this paper
nLSALog: An Anomaly Detection Framework for Log Sequence in Security Management
2019cites this paper
SparseSpeech: Unsupervised Acoustic Unit Discovery with Memory-Augmented Sequence Autoencoders
2019cites this paper
Deep Sequential Models for Suicidal Ideation From Multiple Source Data
2019influential citation
Ensemble Approach for Natural Language Question Answering Problem
2019cites this paper
A survey of 25 years of evaluation
2019cites this paper
Using Dependency Information to Enhance Attention Mechanism for Aspect-Based Sentiment Analysis
2019cites this paper
Attention in Recurrent Neural Networks for Ransomware Detection
2019cites this paper
Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing
2019influential citation
State-Regularized Recurrent Neural Networks
2019influential citation
Bidirectional LSTM with attention mechanism and convolutional layer for text classification
2019cites this paper
Attention in Natural Language Processing
2019cites this paper
An Efficient Model for Sentiment Analysis of Electronic Product Reviews in Vietnamese
2019cites this paper
Images2Poem: Generating Chinese Poetry from Image Streams
2018cites this paper
Persistence pays off: Paying Attention to What the LSTM Gating Mechanism Persists
2018cites this paper
Enhance Machine Reading Comprehension on Multiple Sentence Questions with Gated and Dense Coreference Information
2018cites this paper
Character-Level Language Modeling with Deeper Self-Attention
2018cites this paper
Memory Architectures in Recurrent Neural Network Language Models
2018influential citation
Neural Machine Translation with Key-Value Memory-Augmented Attention
2018influential citation
Focusing on What is Relevant: Time-Series Learning and Understanding using Attention
2018cites this paper
Natural Answer Generation with Heterogeneous Memory
2018cites this paper
Continuous Learning in a Hierarchical Multiscale Neural Network
2018cites this paper
Generative Stock Question Answering
2018cites this paper
Neural Models for Reasoning over Multiple Mentions Using Coreference
2018cites this paper
Meta-Learning a Dynamical Language Model
2018cites this paper
Recurrent Neural Network Attention Mechanisms for Interpretable System Log Anomaly Detection
2018cites this paper
Fast Weight Long Short-Term Memory
2018cites this paper
A Neural Language Model with a Modified Attention Mechanism for Software Code
2018influential citation
From Sequence to Attention; Search for a Compositional Bias in Sequence-to-Sequence Models
2018cites this paper
Extractive Summary as Discrete Latent Variables
2018cites this paper
Modeling Local Dependence in Natural Language with Multi-channel Recurrent Neural Networks
2018cites this paper
QA2Explanation: Generating and Evaluating Explanations for Question Answering Systems over Knowledge Graph
2018cites this paper
Exploring the Use of Attention within an Neural Machine Translation Decoder States to Translate Idioms
2018influential citation
Multi-Modal Sequence Fusion via Recursive Attention for Emotion Recognition
2018cites this paper
Middle-Out Decoding
2018influential citation
Modeling Localness for Self-Attention Networks
2018cites this paper
Integrating Transformer and Paraphrase Rules for Sentence Simplification
2018cites this paper
A Survey of the Usages of Deep Learning in Natural Language Processing
2018cites this paper
Recognizing Textual Entailment with Attentive Reading and Writing Operations
2018cites this paper
Applying self-attention neural networks for sentiment analysis classification and time-series regression tasks
2018cites this paper
Key-value Attention Mechanism for Neural Machine Translation
2017influential citation
Learning to Remember Translation History with a Continuous Cache
2017cites this paper