When Scripts Diverge: Strengthening Low-Resource Neural Machine Translation Through Phonetic Cross-Lingual Transfer

Ammon Shurtz,Chris Richardson,Stephen D. Richardson

Published 2025 in Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)

ABSTRACT

Multilingual Neural Machine Translation (MNMT) models enhance translation quality for low-resource languages by exploiting cross-lingual similarities during training—a process known as knowledge transfer. This transfer is particularly effective between languages that share lexical or structural features, often enabled by a common orthography. However, languages with strong phonetic and lexical similarities but distinct writing systems experience limited benefits, as the absence of a shared orthography hinders knowledge transfer. To address this limitation, we propose an approach based on phonetic information that enhances token-level alignment across scripts by leveraging transliterations. We systematically evaluate several phonetic transcription techniques and strategies for incorporating phonetic information into NMT models. Our results show that using a shared encoder to process orthographic and phonetic inputs separately consistently yields the best performance for Khmer, Thai, and Lao in both directions with English, and that our custom Cognate-Aware Transliteration (CAT) method consistently improves translation quality over the baseline.

PUBLICATION RECORD

Publication year
2025
Venue
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Publication date
Unknown publication date
Fields of study
Not labeled
Identifiers
DOI 10.18653/v1/2025.mrl-main.22
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Prompting with Phonemes: Enhancing LLM Multilinguality for non-Latin Script Languages
2024cited by this paper
Scaling neural machine translation to 200 languages
2024cited by this paper
CORI: CJKV Benchmark with Romanization Integration - a Step towards Cross-lingual Transfer beyond Textual Scripts
2024cited by this paper
RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models via Romanization
2024cited by this paper
Neural Methods for Aligning Large-Scale Parallel Corpora from the Web for South and East Asian Languages
2024cited by this paper
IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages
2023cited by this paper
Romanization-based Large-scale Adaptation of Multilingual Language Models
2023cited by this paper
Multilingual Pixel Representations for Translation and Effective Cross-lingual Transfer
2023cited by this paper
PyThaiNLP: Thai Natural Language Processing in Python
2023cited by this paper
Enhancing Cross-lingual Transfer via Phonemic Transcription Integration
2023cited by this paper
African Substrates Rather Than European Lexifiers to Augment African-diaspora Creole Translation
2023cited by this paper
Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages
2023cited by this paper
Does Transliteration Help Multilingual Language Modeling?
2022cited by this paper
Nick J. Enfield, Mainland Southeast Asian Languages, A Concise Typological Introduction
2022cited by this paper
Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages
2022cited by this paper
Machine Translation of Low-Resource Indo-European Languages
2021cited by this paper
When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models
2020cited by this paper
Complete Multilingual Neural Machine Translation
2020cited by this paper
Improving Multilingual Neural Machine Translation For Low-Resource Languages: French, English - Vietnamese
2020cited by this paper
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
2019cited by this paper
A Universal Parent Model for Low-Resource Neural Machine Translation Transfer
2019cited by this paper
Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT
2019cited by this paper
Massively Multilingual Neural Machine Translation
2019cited by this paper
Unsupervised Cross-lingual Representation Learning at Scale
2019cited by this paper
How Multilingual is Multilingual BERT?
2019cited by this paper
Cognate-aware morphological segmentation for multilingual neural translation
2018cited by this paper
Multi-Source Cross-Lingual Model Transfer: Learning What to Share
2018cited by this paper
A Call for Clarity in Reporting BLEU Scores
2018cited by this paper
Out-of-the-box Universal Romanization Tool uroman
2018cited by this paper
Rapid Adaptation of Neural Machine Translation to New Languages
2018cited by this paper
chrF++: words helping character n-grams
2017cited by this paper
Attention is All you Need
2017cited by this paper
Transfer Learning for Low-Resource Neural Machine Translation
2016cited by this paper
Sound Correspondences in the World's Languages
2013cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002cited by this paper
Etude comparative de la distribution florale dans une portion des Alpes et des Jura
year unknowncited by this paper

CITED BY

No citing papers are available for this paper.