Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character n-gram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.1 and 1.3 BLEU, respectively.
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich,B. Haddow,Alexandra Birch
Published 2015 in Annual Meeting of the Association for Computational Linguistics
ABSTRACT
PUBLICATION RECORD
- Publication year
2015
- Venue
Annual Meeting of the Association for Computational Linguistics
- Publication date
2015-08-31
- Fields of study
Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
LINKED PAPERS
- Effective Approaches to Attention-based Neural Machine Translation
- wmt english-german translation task part of · The WMT English-German translation task is an evaluation task within the WMT 15 benchmark.
- bleu score related to · BLEU and BLEU score refer to the same automatic machine-translation evaluation metric for translation quality.
CLAIMS
CONCEPTS
- back-off dictionary
A baseline approach for handling out-of-vocabulary words in NMT by substituting translations retrieved from an external dictionary.
Aliases: dictionary back-off
YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review - bleu
A standard automatic metric for evaluating machine translation quality by comparing n-gram overlap with reference translations.
YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review - byte pair encoding
A data compression algorithm adapted here as a word segmentation technique that iteratively merges frequent character pairs into subword units.
Aliases: BPE
YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review - character n-gram models
A word segmentation approach that splits words into fixed-length character sequences, discussed here as an alternative to BPE.
Aliases: character n-grams
YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review - neural machine translation
A sequence-to-sequence translation approach using neural networks, here extended to handle rare and unknown words via subword segmentation.
Aliases: NMT
YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review - open-vocabulary translation
The capability to translate words not present in the model's fixed training vocabulary, including rare and unknown words.
Aliases: OOV translation
YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review - subword units
Sub-word segments used to encode rare and unknown words, allowing NMT models to generalize beyond a fixed vocabulary.
Aliases: subword segmentation
YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review - wmt 15
The 2015 Workshop on Machine Translation shared task benchmark, used here for English-German and English-Russian translation evaluation.
Aliases: WMT15, WMT 2015
YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review
REFERENCES
Showing 1-41 of 41 references · Page 1 of 1