Copied Monolingual Data Improves Low-Resource Neural Machine Translation

Antonio Valerio Miceli Barone,Kenneth Heafield

Published 2017 in Conference on Machine Translation

ABSTRACT

We train a neural machine translation (NMT) system to both translate source-language text and copy target-language text, thereby exploiting monolingual corpora in the target language. Specifically, we create a bitext from the monolingual text in the target language so that each source sentence is identical to the target sentence. This copied data is then mixed with the parallel corpus and the NMT system is trained like normal, with no metadata to distinguish the two input languages. Our proposed method proves to be an effective way of incorporating monolingual data into low-resource NMT. see gains of up to 1.2 BLEU over a strong baseline with back-translation. Further analysis shows that the linguis-tic phenomena behind these gains are different from and largely orthogonal to back-translation, with our copied corpus method improving accuracy on named entities and other words that should remain identical between the source and target languages.

PUBLICATION RECORD

  • Publication year

    2017

  • Venue

    Conference on Machine Translation

  • Publication date

    Unknown publication date

  • Fields of study

    Linguistics, Computer Science

  • Identifiers
  • External record

    Open on Semantic Scholar

  • Source metadata

    Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

  • No claims are published for this paper.

CONCEPTS

  • No concepts are published for this paper.

REFERENCES

Showing 1-21 of 21 references · Page 1 of 1

CITED BY

Showing 1-100 of 199 citing papers · Page 1 of 2