Neural Machine Translation of Rare Words with Subword Units

Published 2015 in Annual Meeting of the Association for Computational Linguistics

ABSTRACT

Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character n-gram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.1 and 1.3 BLEU, respectively.

PUBLICATION RECORD

Publication year
2015
Venue
Annual Meeting of the Association for Computational Linguistics
Publication date
2015-08-31
Fields of study
Computer Science
Identifiers
DOI 10.18653/v1/P16-1162 arXiv 1508.07909
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

LINKED PAPERS

Effective Approaches to Attention-based Neural Machine Translation
20152 semantic links2 concept links0 claim links
- wmt english-german translation task part of · The WMT English-German translation task is an evaluation task within the WMT 15 benchmark.
- bleu score related to · BLEU and BLEU score refer to the same automatic machine-translation evaluation metric for translation quality.

CLAIMS

Encoding rare and unknown words as sequences of subword units enables open-vocabulary translation in NMT models without relying on a back-off dictionary.
Confidence 0.92

YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review
Subword models based on byte pair encoding improve over a back-off dictionary baseline on WMT 15 English-German by 1.1 BLEU and English-Russian by 1.3 BLEU.
Confidence 0.97

YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review

CONCEPTS

back-off dictionary
baseline, method

A baseline approach for handling out-of-vocabulary words in NMT by substituting translations retrieved from an external dictionary.

Aliases: dictionary back-off

YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review
bleu
evaluation metric

A standard automatic metric for evaluating machine translation quality by comparing n-gram overlap with reference translations.

YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review
byte pair encoding
method, algorithm

A data compression algorithm adapted here as a word segmentation technique that iteratively merges frequent character pairs into subword units.

Aliases: BPE

YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review
character n-gram models
method

A word segmentation approach that splits words into fixed-length character sequences, discussed here as an alternative to BPE.

Aliases: character n-grams

YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review
neural machine translation
method, framework

A sequence-to-sequence translation approach using neural networks, here extended to handle rare and unknown words via subword segmentation.

Aliases: NMT

YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review
open-vocabulary translation
task, problem

The capability to translate words not present in the model's fixed training vocabulary, including rare and unknown words.

Aliases: OOV translation

YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review
subword units
method

Sub-word segments used to encode rare and unknown words, allowing NMT models to generalize beyond a fixed vocabulary.

Aliases: subword segmentation

YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review
wmt 15
benchmark, evaluation setting

The 2015 Workshop on Machine Translation shared task benchmark, used here for English-German and English-Russian translation evaluation.

Aliases: WMT15, WMT 2015

YW (7j53svhkbf) extractionAnonymous (12632b8b5f) review

REFERENCES

Variable length word encodings for neural translation models
2016cited by this paper
Montreal Neural Machine Translation Systems for WMT’15
2015influential reference
Character-based Neural Machine Translation
2015cited by this paper
chrF: character n-gram F-score for automatic MT evaluation
2015cited by this paper
Character-Aware Neural Language Models
2015cited by this paper
Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation
2015cited by this paper
Results of the WMT15 Metrics Shared Task
2015cited by this paper
Effective Approaches to Attention-based Neural Machine Translation
2015influential reference
A Joint Dependency Model of Morphological and Syntactic Structure for Statistical Machine Translation
2015cited by this paper
Variable-Length Word Encodings for Neural Translation Models
2015influential reference
The Edinburgh/JHU Phrase-based Machine Translation Systems for WMT 2015
2015influential reference
Integrating an Unsupervised Transliteration Model into Statistical Machine Translation
2014cited by this paper
Neural Machine Translation by Jointly Learning to Align and Translate
2014cited by this paper
Addressing the Rare Word Problem in Neural Machine Translation
2014influential reference
Compositional Morphology for Word Representations and Language Modelling
2014cited by this paper
On Using Very Large Target Vocabulary for Neural Machine Translation
2014influential reference
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
2014cited by this paper
Sequence to Sequence Learning with Neural Networks
2014cited by this paper
Recurrent Continuous Translation Models
2013cited by this paper
A Simple, Fast, and Effective Reparameterization of IBM Model 2
2013cited by this paper
Better Word Representations with Recursive Neural Networks for Morphology
2013cited by this paper
A New Algorithm For Data Compression
2013influential reference
ADADELTA: An Adaptive Learning Rate Method
2012cited by this paper
Machine Translation without Words through Substring Alignment
2012cited by this paper
Unsupervised Morphology Rivals Supervised Morphology for Arabic MT
2012cited by this paper
Character-Based Pivot Translation for Under-Resourced Languages and Domains
2012cited by this paper
On the difficulty of training recurrent neural networks
2012influential reference
SUBWORD LANGUAGE MODELING WITH NEURAL NETWORKS
2011cited by this paper
Character-Based PSMT for Closely Related Languages
2009influential reference
Unsupervised Multilingual Learning for Morphological Segmentation
2008cited by this paper
Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner
2007cited by this paper
Moses: Open Source Toolkit for Statistical Machine Translation
2007cited by this paper
Can We Translate Letters?
2007cited by this paper
Empirical Methods for Compound Splitting
2003cited by this paper
Modelling out-of-vocabulary words for robust speech recognition
2002cited by this paper
Unsupervised Discovery of Morphemes
2002cited by this paper
Modeling out-of-vocabulary words for robust speech recognition
2000cited by this paper
Improving SMT quality with morpho-syntactic analysis
2000cited by this paper
A new algorithm for data compression
1994cited by this paper
Word hy-phen-a-tion by com-put-er
1983cited by this paper
of the Association for Computational Linguistics
year unknowncited by this paper

CITED BY

Deep Learning Meets Synthetic Biology: Applications Across Molecular, Cellular, and Metabolic Scales
2026cites this paper
Targeted Syntactic Evaluation of Language Models on Georgian Case Alignment
2026cites this paper
Form and Meaning in Intrinsic Multilingual Evaluations
2026cites this paper
An Integrated Approach to Adapting Open-Source AI Models for Machine Translation of Low-Resource Turkic Languages
2026cites this paper
Deriving Neural Scaling Laws from the statistics of natural language
2026cites this paper
Flow of Spans: Generalizing Language Models to Dynamic Span-Vocabulary via GFlowNets
2026cites this paper
Modern Neuromorphic AI: From Intra-Token to Inter-Token Processing
2026cites this paper
Value-Aware Numerical Representations for Transformer Language Models
2026cites this paper
Say Anything but This: When Tokenizer Betrays Reasoning in LLMs
2026cites this paper
A Survey of Text Deduplication: From Syntactic Matching to Semantic Understanding
2026cites this paper
From Lemmas to Dependencies: What Signals Drive Light Verbs Classification?
2026cites this paper
Edge-Ready Romanian Language Models: Training, Quantization, and Deployment
2026cites this paper
Text-Muddler: an advanced adversarial paradigm for disrupting NLP-based neural architectures in sentiment analysis frameworks
2026cites this paper
Benchmarks Are Not That Out of Distribution: Word Overlap Predicts Performance
2026cites this paper
Large language models for explainable fault diagnosis of machines
2026cites this paper
A Language-Agnostic Hierarchical LoRA-MoE Architecture for CTC-based Multilingual ASR
2026cites this paper
Peek2: A Regex-free implementation of pretokenizers for Byte-level BPE
2026cites this paper
A.X K1 Technical Report
2026cites this paper
Controlled beam search for neural machine translation using subword units leveraging phrase-based statistical machine translation outputs
2026cites this paper
IntelliSA: An Intelligent Static Analyzer for IaC Security Smell Detection Using Symbolic Rules and Neural Inference
2026cites this paper
To Case or Not to Case: An Empirical Study in Learned Sparse Retrieval
2026cites this paper
WADBERT: Dual-channel Web Attack Detection Based on BERT Models
2026cites this paper
Stroke Lesions as a Rosetta Stone for Language Model Interpretability
2026cites this paper
LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers
2026cites this paper
Data Kernel Perspective Space Performance Guarantees for Synthetic Data from Transformer Models
2026cites this paper
Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay
2026cites this paper
Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models
2026cites this paper
Challenges in Translating Technical Lectures: Insights from the NPTEL
2026cites this paper
Unsupervised Cross-Lingual Part-of-Speech Tagging with Monolingual Corpora Only
2026cites this paper
dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning
2026cites this paper
Mujaz: A Summarization-Based Approach for Normalized Vulnerability Description
2026cites this paper
DockerFill: Automatically Completing Dockerfile Code With Syntax-Aware Multi-Task Learning
2026cites this paper
Shield Broken: Black-Box Adversarial Attacks on LLM-Based Vulnerability Detectors
2026cites this paper
Optimizing SMILES token sequences via trie-based refinement and transition graph filtering
2026cites this paper
SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers
2026cites this paper
DNATokenizer: A GPU-First Byte-to-Identifier Tokenizer for High-Throughput DNA Language Models
2026cites this paper
Engineering of Hallucination in Generative AI: It's not a Bug, it's a Feature
2026cites this paper
An Information-Theoretic Perspective on LLM Tokenizers
2026cites this paper
BYOL: Bring Your Own Language Into LLMs
2026cites this paper
The DNA dialect: a comprehensive guide to pretrained genomic language models
2026influential citation
Soft Tail-dropping for Adaptive Visual Tokenization
2026cites this paper
Reducing Tokenization Premiums for Low-Resource Languages
2026cites this paper
Subword-Based Comparative Linguistics across 242 Languages Using Wikipedia Glottosets
2026cites this paper
Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features
2026cites this paper
Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention
2026influential citation
From Generative Modeling to Clinical Classification: A GPT-Based Architecture for EHR Notes
2026cites this paper
Diffusion U-Net coupled with text-attention guided block for medical image segmentation
2026cites this paper
ContraLog: Log File Anomaly Detection with Contrastive Learning and Masked Language Modeling
2026cites this paper
OAT: Ordered Action Tokenization
2026cites this paper
Tokenization and Morphological Fidelity in Uralic NLP: A Cross-Lingual Evaluation
2026cites this paper
How Do Language Models Acquire Character-Level Information?
2026influential citation
Multimodal Latent Reasoning via Hierarchical Visual Cues Injection
2026cites this paper
Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew
2026cites this paper
The importance of morphology-aware subword tokenization for NLP tasks in Slovak language modeling
2026cites this paper
Debugging code world models
2026cites this paper
Tokenizations for Austronesian Language Models: study on languages in Indonesia Archipelago
2026cites this paper
DirMoE: Dirichlet-routed Mixture of Experts
2026cites this paper
A Generative AI Method for Minority Class Handling in Anomaly Detection with Drift and Explainability Analysis
2026influential citation
AntigenLM: Structure-Aware DNA Language Modeling for Influenza
2026cites this paper
Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks
2026cites this paper
SARM: LLM-Augmented Semantic Anchor for End-to-End Live-Streaming Ranking
2026cites this paper
MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models
2026cites this paper
SPCL: Semantic Polymorphism and Commonality Learning for Text-Based Person Retrieval
2026cites this paper
Language models that match reader experience are better predictors of reading times
2026cites this paper
Harnessing DNA Foundation Models for Cross-Species Transcription Factor Binding Site Prediction in Plant Genomes
2026cites this paper
Class-Weighted Prompting for rehearsal-free class-incremental learning
2026cites this paper
Progressive multi-level learning for gloss-free sign language translation
2026cites this paper
L1/2-regularized nonnegative matrix factorization for HMM-based sequence representation learning
2026cites this paper
Exploiting Multimodal Knowledge Graph for Multimodal Machine Translation
2026influential citation
Multi-modal prompt learning for facial expression recognition: Leveraging emojis and large language models
2026cites this paper
Language Models’ Hall of Mirrors Problem: Why AI Alignment Requires Peircean Semiosis
2026cites this paper
AI with Symbolic Empathy: Shannon-Neumann Insight Guided Logic
2026cites this paper
Maithilimt: Developing Multi-Domain Parallel Corpus for Hindi-Maithili Machine Translation
2026influential citation
Enabling Stroke-Level Structural Analysis of Hieroglyphic Scripts without Language-Specific Priors
2026cites this paper
Solar Open Technical Report
2026cites this paper
The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?
2026cites this paper
An Axiomatic Approach to General Intelligence: SANC(E3) - Self-organizing Active Network of Concepts with Energy E3
2026cites this paper
SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding
2026cites this paper
TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction
2026cites this paper
Advancing cyberbullying detection in low-resource languages: a transformer- stacking framework for Bengali
2026cites this paper
How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers
2026cites this paper
Text-Prompted Prompt Generator with Uncertainty Regularization for Rehearsal-Free Class-Incremental Learning
2026cites this paper
You Only Transmit Once: Unified Generation and Comprehension for Efficient Semantic Communication
2026cites this paper
Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models
2026cites this paper
Top 10 Open Challenges Steering the Future of Diffusion Language Model and Its Variants
2026cites this paper
Syllabic Agglutinative Tokenizations for Indonesian LLM: A Study from Gasing Literacy Learning System
2026cites this paper
Pre-ordering representations improve low-resource neural machine translation and application in the Māori language
2026cites this paper
Ancestral sequence reconstruction using generative models
2026cites this paper
UrduLM: A Resource-Efficient Monolingual Urdu Language Model
2026cites this paper
Spelling Bee Embeddings for Language Modeling
2026cites this paper
CHEHAB RL: Learning to Optimize Fully Homomorphic Encryption Computations
2026cites this paper
Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models
2026cites this paper
AdaptBPE: From General Purpose to Specialized Tokenizers
2026cites this paper
Beyond a Single Reference: Training and Evaluation with Paraphrases in Sign Language Translation
2026cites this paper
Cross-Modal Diffusion on Pretrained Alignment Codebook for Multimodal Machine Translation
2026cites this paper
Farewell to Item IDs: Unlocking the Scaling Potential of Large Ranking Models via Semantic Tokens
2026influential citation
BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition
2026cites this paper
From Tokens to Numbers: Continuous Number Modeling for SVG Generation
2026influential citation
Proxy Compression for Language Modeling
2026cites this paper
Computational Basis of Large Language Models’ Decision Making in Social Simulation
2026influential citation