Length bias in Encoder Decoder Models and a Case for Global Conditioning

Published 2016 in Conference on Empirical Methods in Natural Language Processing

ABSTRACT

Encoder-decoder networks are popular for modeling sequences probabilistically in many applications. These models use the power of the Long Short-Term Memory (LSTM) architecture to capture the full dependence among variables, unlike earlier models like CRFs that typically assumed conditional independence among non-adjacent variables. However in practice encoder-decoder models exhibit a bias towards short sequences that surprisingly gets worse with increasing beam size. In this paper we show that such phenomenon is due to a discrepancy between the full sequence margin and the per-element margin enforced by the locally conditioned training objective of a encoder-decoder model. The discrepancy more adversely impacts long sequences, explaining the bias towards predicting short sequences. For the case where the predicted sequences come from a closed set, we show that a globally conditioned model alleviates the above problems of encoder-decoder models. From a practical point of view, our proposed model also eliminates the need for a beam-search during inference, which reduces to an efficient dot-product based search in a vector-space.

PUBLICATION RECORD

Publication year
2016
Venue
Conference on Empirical Methods in Natural Language Processing
Publication date
2016-06-10
Fields of study
Computer Science
Identifiers
DOI 10.18653/v1/D16-1158 arXiv 1606.03402
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Globally Normalized Transition-Based Neural Networks
2016cited by this paper
Smart Reply: Automated Response Suggestion for Email
2016cited by this paper
OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles
2016cited by this paper
Conversational Contextual Cues: The Case of Personalization and History for Response Ranking
2016cited by this paper
Quantization based Fast Inner Product Search
2015cited by this paper
Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks
2015cited by this paper
The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems
2015cited by this paper
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
2015cited by this paper
A Diversity-Promoting Objective Function for Neural Conversation Models
2015cited by this paper
A Neural Conversational Model
2015cited by this paper
RECURRENT NEURAL NETWORKS
2015influential reference
Overcoming the Curse of Sentence Length for Neural Machine Translation using Automatic Segmentation
2014cited by this paper
Show and tell: A neural image caption generator
2014cited by this paper
On Using Very Large Target Vocabulary for Neural Machine Translation
2014cited by this paper
Long short-term memory recurrent neural network architectures for large scale acoustic modeling
2014cited by this paper
Sequence to Sequence Learning with Neural Networks
2014cited by this paper
Neural Machine Translation by Jointly Learning to Align and Translate
2014cited by this paper
On the Properties of Neural Machine Translation: Encoder–Decoder Approaches
2014cited by this paper
Recurrent conditional random field for language understanding
2014cited by this paper
The Most Generative Maximum Margin Bayesian Networks
2013cited by this paper
Generating Sequences With Recurrent Neural Networks
2013cited by this paper
Understanding the exploding gradient problem
2012cited by this paper
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
2011influential reference
Maximum Margin Bayesian Networks
2005cited by this paper
Margin Maximizing Loss Functions
2003cited by this paper
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
2001influential reference
Maximum Entropy Markov Models for Information Extraction and Segmentation
2000cited by this paper
Long Short-Term Memory
1997cited by this paper

CITED BY

Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
2025cites this paper
User-LLM: Efficient LLM Contextualization with User Embeddings
2024cites this paper
ODIN: Disentangled Reward Mitigates Hacking in RLHF
2024cites this paper
Principled Comparisons for End-to-End Speech Recognition: Attention vs Hybrid at the 1000-Hour Scale
2024cites this paper
End-to-End Speech Recognition: A Survey
2023cites this paper
RASR2: The RWTH ASR Toolkit for Generic Sequence-to-sequence Speech Recognition
2023cites this paper
Chunked Attention-Based Encoder-Decoder Model for Streaming Speech Recognition
2023cites this paper
Automatic Chart Understanding: A Review
2023cites this paper
Neural machine translation from text to sign language
2023cites this paper
Jam or Cream First? Modeling Ambiguity in Neural Machine Translation with SCONES
2022cites this paper
EANA: Reducing Privacy Risk on Large-scale Recommendation Models
2022cites this paper
Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models
2022cites this paper
Multi-Sentence Resampling: A Simple Approach to Alleviate Dataset Length Bias and Beam-Search Degradation
2021cites this paper
Improving neural machine translation with sentence alignment learning
2021cites this paper
Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks
2021cites this paper
Language Model Evaluation Beyond Perplexity
2021cites this paper
Mode recovery in neural autoregressive sequence modeling
2021cites this paper
Sampling-Based Minimum Bayes Risk Decoding for Neural Machine Translation
2021cites this paper
Learning Federated Representations and Recommendations with Limited Negatives
2021cites this paper
Decoding Methods in Neural Language Generation: A Survey
2021cites this paper
Sampling-Based Approximations to Minimum Bayes Risk Decoding for Neural Machine Translation
2021cites this paper
Challenges of Building an Intelligent Chatbot
2020cites this paper
AmbigQA: Answering Ambiguous Open-domain Questions
2020cites this paper
The Roles of Language Models and Hierarchical Models in Neural Sequence-to-Sequence Prediction
2020cites this paper
R EPRESENTATION AND B IAS IN M ULTILINGUAL NLP: I NSIGHTS FROM C ONTROLLED E XPERIMENTS ON C ONDITIONAL L ANGUAGE M ODELING
2020cites this paper
Consistency of a Recurrent Language Model with Respect to Incomplete Decoding
2020cites this paper
Robust Beam Search for Encoder-Decoder Attention Based Speech Recognition Without Length Bias
2020cites this paper
Early Stage LM Integration Using Local and Global Log-Linear Combination
2020cites this paper
Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation
2020influential citation
MLE-guided parameter search for task loss minimization in neural sequence modeling
2020influential citation
If Beam Search Is the Answer, What Was the Question?
2020cites this paper
Why Neural Machine Translation Prefers Empty Outputs
2020cites this paper
Neural Machine Translation
2019cites this paper
Calibration of Encoder Decoder Models for Neural Machine Translation
2019cites this paper
Neural Machine Translation: A Review
2019influential citation
On NMT Search Errors and Model Errors: Cat Got Your Tongue?
2019cites this paper
Diversifying Reply Suggestions Using a Matching-Conditional Variational Autoencoder
2019cites this paper
Deep learning methods for knowledge base population
2018cites this paper
Better Conversations by Modeling, Filtering, and Optimizing for Coherence and Diversity
2018cites this paper
Adversarial Evaluation of Dialogue Models
2017cites this paper
YJTI at the NTCIR-13 STC Japanese Subtask
2017cites this paper
Learning to Decode for Future Success
2017cites this paper
UvA-DARE (Digital Academic Repository) Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation
year unknowninfluential citation