BPE and CharCNNs for Translation of Morphology: A Cross-Lingual Comparison and Analysis

Published 2018 in arXiv.org

ABSTRACT

Neural Machine Translation (NMT) in low-resource settings and of morphologically rich languages is made difficult in part by data sparsity of vocabulary words. Several methods have been used to help reduce this sparsity, notably Byte-Pair Encoding (BPE) and a character-based CNN layer (charCNN). However, the charCNN has largely been neglected, possibly because it has only been compared to BPE rather than combined with it. We argue for a reconsideration of the charCNN, based on cross-lingual improvements on low-resource data. We translate from 8 languages into English, using a multi-way parallel collection of TED transcripts. We find that in most cases, using both BPE and a charCNN performs best, while in Hebrew, using a charCNN over words is best.

PUBLICATION RECORD

Publication year
2018
Venue
arXiv.org
Publication date
2018-09-05
Fields of study
Linguistics, Computer Science
Identifiers
arXiv 1809.01301
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Synthetic and Natural Noise Both Break Neural Machine Translation
2017cited by this paper
Stronger Baselines for Trustable Results in Neural Machine Translation
2017influential reference
What do Neural Machine Translation Models Learn about Morphology?
2017cited by this paper
Character-based Neural Machine Translation
2016cited by this paper
Character-Aware Neural Language Models
2015cited by this paper
Neural Machine Translation of Rare Words with Subword Units
2015cited by this paper
Effective Approaches to Attention-based Neural Machine Translation
2015cited by this paper
Character-based Neural Machine Translation
2015cited by this paper
A New Algorithm For Data Compression
2013cited by this paper
ADADELTA: An Adaptive Learning Rate Method
2012influential reference
Linguistically Naïve != Language Independent: Why NLP Needs Linguistic Typology
2009influential reference
A new algorithm for data compression
1994cited by this paper

CITED BY

Analyzing Byte-Pair Encoding on Monophonic and Polyphonic Symbolic Music: A Focus on Musical Phrase Segmentation
2024cites this paper
System Description of the NordicsAlps Submission to the AmericasNLP 2024 Machine Translation Shared Task
2024cites this paper
Impact of Sequence Length and Copying on Clause-Level Inflection
2022cites this paper
Evaluating Various Tokenizers for Arabic Text Classification
2021cites this paper
Robust Open-Vocabulary Translation from Visual Text Representations
2021cites this paper
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
2021cites this paper
On the Importance of Tokenization in Arabic Embedding Models
2020cites this paper
On the Linguistic Representational Power of Neural Machine Translation Models
2019cites this paper
Character-Aware Decoder for Translation into Morphologically Rich Languages
2018cites this paper