Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation

Yanjun Ma,Andy Way

Published 2009 in Conference of the European Chapter of the Association for Computational Linguistics

ABSTRACT

We introduce a word segmentation approach to languages where word boundaries are not orthographically marked, with application to Phrase-Based Statistical Machine Translation (PB-SMT). Instead of using manually segmented monolingual domain-specific corpora to train segmenters, we make use of bilingual corpora and statistical word alignment techniques. First of all, our approach is adapted for the specific translation task at hand by taking the corresponding source (target) language into account. Secondly, this approach does not rely on manually segmented training data so that it can be automatically adapted for different domains. We evaluate the performance of our segmentation approach on PB-SMT tasks from two domains and demonstrate that our approach scores consistently among the best results across different data conditions.

PUBLICATION RECORD

  • Publication year

    2009

  • Venue

    Conference of the European Chapter of the Association for Computational Linguistics

  • Publication date

    2009-03-30

  • Fields of study

    Linguistics, Computer Science

  • Identifiers
  • External record

    Open on Semantic Scholar

  • Source metadata

    Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

  • No claims are published for this paper.

CONCEPTS

  • No concepts are published for this paper.

REFERENCES

Showing 1-26 of 26 references · Page 1 of 1

CITED BY

Showing 1-39 of 39 citing papers · Page 1 of 1