ChakmaNMT: Machine Translation for a Low-Resource and Endangered Language via Transliteration

Aunabil Chakma,Aditya Chakma,Masum Hasan,Soham Khisa,Chumui Tripura,Rifat Shahriyar

Published 2024 in Unknown venue

ABSTRACT

We present the first systematic study of machine translation for Chakma, an endangered and extremely low-resource Indo-Aryan language, with the goal of supporting language access and preservation. We introduce a new Chakma-Bangla parallel and monolingual dataset, along with a trilingual Chakma-Bangla-English benchmark for evaluation. To address script mismatch and data scarcity, we propose a character-level transliteration framework that exploits the close orthographic and phonological relationship between Chakma and Bangla, preserving semantic content while enabling effective transfer from Bangla and multilingual pretrained models. We benchmark from-scratch MT, fine-tuned pretrained models, and large language models via in-context learning. Results show that transliteration is essential and that fine-tuning and in-context learning substantially outperform from-scratch baselines, with strong asymmetry across translation directions.

PUBLICATION RECORD

Publication year
2024
Venue
Unknown venue
Publication date
2024-10-14
Fields of study
Linguistics, Computer Science
Identifiers
arXiv 2410.10219
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Self-ChakmaNet: A deep learning framework for indigenous language learning using handwritten characters
2023cited by this paper
Scaling Speech Technology to 1, 000+ Languages
2023cited by this paper
BanglaNLG and BanglaT5: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in Bangla
2022influential reference
Language Models are Few-Shot Learners
2020cited by this paper
Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation
2020influential reference
Multilingual Denoising Pre-training for Neural Machine Translation
2020cited by this paper
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
2020cited by this paper
The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English
2019cited by this paper
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
2018cited by this paper
A Call for Clarity in Reporting BLEU Scores
2018cited by this paper
Trivial Transfer Learning for Low-Resource Neural Machine Translation
2018cited by this paper
Iterative Back-Translation for Neural Machine Translation
2018influential reference
Attention is All you Need
2017influential reference
Six Challenges for Neural Machine Translation
2017cited by this paper
Introduction of the Asian Language Treebank
2016cited by this paper
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
2016cited by this paper
Improving Neural Machine Translation Models with Monolingual Data
2015cited by this paper
On Using Monolingual Corpora in Neural Machine Translation
2015cited by this paper
chrF: character n-gram F-score for automatic MT evaluation
2015cited by this paper
Effective Approaches to Attention-based Neural Machine Translation
2015cited by this paper
Neural Machine Translation by Jointly Learning to Align and Translate
2014cited by this paper
Hindi-to-Urdu Machine Translation through Transliteration
2010cited by this paper
Understanding the difficulty of training deep feedforward neural networks
2010cited by this paper
Parallel corpora for medium density languages
2007cited by this paper
Statistical Phrase-Based Translation
2003influential reference

CITED BY

No citing papers are available for this paper.