The Web as a Parallel Corpus

Published 2003 in International Conference on Computational Logic

ABSTRACT

Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.

PUBLICATION RECORD

Publication year
2003
Venue
International Conference on Computational Logic
Publication date
2003-09-01
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.1162/089120103322711578
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Bootstrapping
2020cited by this paper
An Unsupervised Method for Word Sense Tagging using Parallel Corpora
2002cited by this paper
Inducing Information Extraction Systems for New Languages via Cross-language Projection
2002cited by this paper
A Hands-On Study of the Reliability and Coherence of Evaluation Metrics
2002cited by this paper
From Words to Corpora: Recognizing Translation
2002influential reference
Building a Shallow Arabic Morphological Analyser in One Day
2002cited by this paper
Discriminative Training and Maximum Entropy Models for Statistical Machine Translation
2002cited by this paper
Word-level Alignment for Multilingual Resource Acquisition
2002cited by this paper
Evaluating Translational Correspondence using Annotation Projection
2002cited by this paper
Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora
2001cited by this paper
Spanish Language Processing at University of Maryland: Building Infrastructure for Multilingual Applications
2001cited by this paper
Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora
2001cited by this paper
Improved Cross-Language Retrieval using Backoff Translation
2001cited by this paper
Filtering noisy parallel corpora of web pages
2001influential reference
A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora
2001cited by this paper
Detection of Translational Equivalence
2001cited by this paper
Article
2000cited by this paper
Models of translation equivalence among words
2000influential reference
Parallel Web text mining for cross-language IR
2000influential reference
The Bible as a Parallel Corpus: Annotating the ‘Book of 2000 Tongues’
1999cited by this paper
BITS: a method for bilingual text search over the Web
1999influential reference
Automatic Construction of Weighted String Similarity Measures
1999cited by this paper
Mining the Web for Bilingual Text
1999influential reference
An Information-Theoretic Definition of Similarity
1998cited by this paper
Parallel strands: a preliminary investigation into mining the Web for bilingual text
1998cited by this paper
Semi-Automatic Acquisition of Domain-Specific Translation Lexicons
1997cited by this paper
Syntactic Clustering of the Web
1997cited by this paper
Automatic Discovery of Non-Compositional Compounds in Parallel Data
1997cited by this paper
Cross-Language Text Retrieval Research in the USA
1997cited by this paper
A TREC Evaluation of Query Translation Methods For Multi-Lingual Text Retrieval
1995cited by this paper
A hierarchical Dirichlet language model
1995cited by this paper
Unsupervised Word Sense Disambiguation Rivaling Supervised Methods
1995cited by this paper
N-gram-based text categorization
1994cited by this paper
Statistical Identification of Language
1994cited by this paper
Introduction to the Special Issue on Computational Linguistics Using Large Corpora
1993cited by this paper
Network Flows: Theory, Algorithms, and Applications
1993influential reference
Identifying word correspondence in parallel texts
1991cited by this paper
Identifying Word Correspondences in Parallel Texts
1991cited by this paper
A Statistical Approach to Machine Translation
1990cited by this paper
Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Tex
1988cited by this paper
The mathematical theory of communication
1950cited by this paper

CITED BY

Comparable Corpora: Opportunities for New Research Directions
2025cites this paper
From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora
2025cites this paper
PRALEKHA: Cross-Lingual Document Alignment for Indic Languages
2024cites this paper
Smart Bilingual Focused Crawling of Parallel Documents
2024cites this paper
Emerging resources, enduring challenges: a comprehensive study of Kashmiri parallel corpus
2024cites this paper
A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism
2024cites this paper
Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality Estimation
2024cites this paper
Findings of the WMT 2023 Shared Task on Parallel Data Curation
2023cites this paper
Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation
2023cites this paper
Supporting Global Context under Evolving User Intents during Data Exploration
2023cites this paper
Parallel Corpus Creation for NMT using Web Scraping and Filtering
2023cites this paper
A General-Purpose Multilingual Document Encoder
2023cites this paper
GATITOS: Using a New Multilingual Lexicon for Low-resource Machine Translation
2023cites this paper
Loanword identification based on web resources: A case study on wikipedia
2023cites this paper
Design of New Word Retrieval Algorithm for Chinese-English Bilingual Parallel Corpus
2022cites this paper
Finetuning a Kalaallisut-English machine translation system using web-crawled data
2022cites this paper
Making the most of comparable corpora in Neural Machine Translation: a case study
2022cites this paper
Building Machine Translation Systems for the Next Thousand Languages
2022cites this paper
Simplification of English and Bengali Sentences for Improving Quality of Machine Translation
2022cites this paper
Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages
2022cites this paper
Building Comparable Corpora for Assessing Multi-Word Term Alignment
2022cites this paper
Domain Adaptation of Machine Translation with Crowdworkers
2022cites this paper
NECAT-CLWE: A S IMPLE B UT E FFICIENT P ARALLEL D ATA G ENERATION A PPROACH FOR L OW R ESOURCE N EURAL M ACHINE T RANSLATION
2022cites this paper
Can Synthetic Translations Improve Bitext Quality?
2022cites this paper
Mining Japanese-Vietnamese multi-level parallel text corpus from Wikipedia data resource
2021cites this paper
Augmenting Training Data for Low-Resource Neural Machine Translation via Bilingual Word Embeddings and BERT Language Modelling
2021cites this paper
RANLP 2021 Workshop Recent Advances in Natural Language Processing 14th Workshop on Building and Using Comparable Corpora
2021cites this paper
CDA: a Cost Efficient Content-based Multilingual Web Document Aligner
2021influential citation
Low-Resource Machine Translation Training Curriculum Fit for Low-Resource Languages
2021cites this paper
Towards a Unified Framework for Learning and Reasoning
2021cites this paper
Impact assessment indicators for the UK Web Archive
2021cites this paper
Tamizhi-Net OCR Creating A Quality Large Scale Tamil-Sinhala-English Parallel Corpus Using Deep Learning Based Printed Character Recognition (PCR)
2021cites this paper
Semantic-Aware Deep Neural Attention Network for Machine Translation Detection
2021cites this paper
“Wikily” Supervised Neural Translation Tailored to Cross-Lingual Tasks
2021cites this paper
Recent advances of low-resource neural machine translation
2021cites this paper
Majority Voting with Bidirectional Pre-translation For Bitext Retrieval
2021cites this paper
Exploiting Parallel Corpora to Improve Multilingual Embedding based Document and Sentence Alignment
2021cites this paper
Deep learning approach for Translating Arabic Holy Quran into Italian language
2021cites this paper
Don’t Rule Out Monolingual Speakers: A Method For Crowdsourcing Machine Translation Data
2021cites this paper
Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English
2021cites this paper
Majority Voting with Bidirectional Pre-translation Improves Bitext Retrieval
2021cites this paper
"Wikily" Neural Machine Translation Tailored to Cross-Lingual Tasks
2021cites this paper
CREATING AN EFFICIENT PARALLEL CORPUS FOR BANGLA-ENGLISH STATISTICAL MACHINE TRANSLATION
2021cites this paper
A Systematic Literature Review on Extraction of Parallel Corpora from Comparable Corpora
2021cites this paper
Parallel sentence extraction to improve cross-language information retrieval from Wikipedia
2021cites this paper
Kelantan and Sarawak Malay Dialects: Parallel Dialect Text Collection and Alignment Using Hybrid Distance-Statistical-Based Phrase Alignment Algorithm
2021cites this paper
Improved Cross-Lingual Document Similarity Measurement
2020cites this paper
Sinhala and English Document Alignment using Statistical Machine Translation
2020cites this paper
Self-Supervised Learning for Pairwise Data Refinement
2020cites this paper
Near Synonymy Analysis of the Descriptive Adjective Pale in English and Bled, -a, -o in Serbian
2020cites this paper
Refined Urdu Lexicon Development K-Means Clustering Based Computational Model Using Colloquial Romanized Urdu Dataset
2020cites this paper
Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings
2020cites this paper
Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus
2020cites this paper
On Learning Language-Invariant Representations for Universal Machine Translation
2020cites this paper
Preparation of Sentiment tagged Parallel Corpus and Testing its effect on Machine Translation
2020cites this paper
Genre Analysis and Machine Translation: a comparison between Italian and Chinese trade fair promotional brochures
2020influential citation
Pretrained Transformers for Text Ranking: BERT and Beyond
2020cites this paper
Mono- and cross-lingual paraphrased text reuse and extrinsic plagiarism detection
2020cites this paper
Fostering intuitive competence in L2 for a better performance in EAP writing through fraze.it in a Turkish context
2020cites this paper
When and Why is Unsupervised Neural Machine Translation Useless?
2020cites this paper
Exploiting Sentence Order in Document Alignment
2020cites this paper
The Roles of Language Models and Hierarchical Models in Neural Sequence-to-Sequence Prediction
2020cites this paper
Parallel Corpora
2020cites this paper
Handle with Care: A Case Study in Comparable Corpora Exploitation for Neural Machine Translation
2020cites this paper
Two Huge Title and Keyword Generation Corpora of Research Articles
2020cites this paper
ParaCrawl: Web-Scale Acquisition of Parallel Corpora
2020cites this paper
The suffix -ee: history, productivity, frequency and violation of st
2020cites this paper
Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover’s Distance
2020cites this paper
Document Alignment for Generation of English-Punjabi Comparable Corpora from Wikipedia
2020cites this paper
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Translation
2020cites this paper
Tailoring and evaluating the Wikipedia for in-domain comparable corpora extraction
2020influential citation
Mining Scientific and Technical Literature
2020cites this paper
Parallel Sentence Mining by Constrained Decoding
2020cites this paper
Fostering intuitive competence in L2 for a better performance in EAP writing through fraze.it in a Turkish context
2020cites this paper
On the Pronunciation Dictionaries of Contemporary German: Principles of Selection and Lemmatization of Lexical Material
2020cites this paper
Web Corpora
2020cites this paper
Automated Building of Classic Chinese-English Dictionary and Chinese-Hungarian Dictionary
2019cites this paper
Online Parallel Data Extraction with Neural Machine Translation
2019influential citation
Hierarchical Document Encoder for Parallel Corpus Mining
2019cites this paper
International Journal of Recent Technology and Engineering (IJRTE)
2019cites this paper
Learning Bilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron
2019cites this paper
Language-Independent Methods for Identifying Cross-Lingual Similarity in Wikipedia
2019influential citation
RANLP 2019 12 th Workshop on Building and Using Comparable Corpora
2019cites this paper
Self-Induced Curriculum Learning in Neural Machine Translation
2019cites this paper
Sentence and Word Weighting for Neural Machine Translation Domain Adaptation
2019cites this paper
A Hybrid of Sentence-Level Approach and Fragment-Level Approach of Parallel Text Extraction from Comparable Text
2019cites this paper
Amharic-Arabic Neural Machine Translation
2019cites this paper
PC-Corpus: A Persian-Chinese Parallel Corpora
2019cites this paper
Working with parallel corpora
2019cites this paper
An Automatic and a Machine-assisted Method to Clean Bilingual Corpus
2019cites this paper
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web
2019cites this paper
A Massive Collection of Cross-Lingual Web-Document Pairs
2019cites this paper
Crawling Chinese-Myanmar Parallel Corpus: Automatic Collection, Screening and Cleaning Corpus
2019cites this paper
Neural Machine Translation: A Review
2019influential citation
Document Encoder Pooling Dot Product Sent Enc Pooling Sent Enc Pooling Sent Enc Pooling DNN DNN DNN Sentence Encoder Pooling Pooling Dot Product Sentence Encoder Document Encoder Pooling Sentence Level Task Document Level Task
2019cites this paper
Fully Unsupervised Crosslingual Semantic Textual Similarity Metric Based on BERT for Identifying Parallel Data
2019cites this paper
Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax
2019cites this paper
The TransBank Aligner: Cross-Sentence Alignment with Deep Neural Networks
2019cites this paper
Knowledge-Enhanced Bilingual Textual Representations for Cross-Lingual Semantic Textual Similarity
2019cites this paper
Efficient document alignment across scenarios
2019cites this paper