On the Limitations of Unsupervised Bilingual Dictionary Induction

Anders Søgaard,Sebastian Ruder,Ivan Vulic

Published 2018 in Annual Meeting of the Association for Computational Linguistics

ABSTRACT

Unsupervised machine translation - i.e., not assuming any cross-lingual supervision signal, whether a dictionary, translations, or comparable corpora - seems impossible, but nevertheless, Lample et al. (2017) recently proposed a fully unsupervised machine translation (MT) model. The model relies heavily on an adversarial, unsupervised cross-lingual word embedding technique for bilingual dictionary induction (Conneau et al., 2017), which we examine here. Our results identify the limitations of current unsupervised MT: unsupervised bilingual dictionary induction performs much worse on morphologically rich languages that are not dependent marking, when monolingual corpora from different domains or different embedding algorithms are used. We show that a simple trick, exploiting a weak supervision signal from identical words, enables more robust induction and establish a near-perfect correlation between unsupervised bilingual dictionary induction performance and a previously unexplored graph similarity metric.

PUBLICATION RECORD

Publication year
2018
Venue
Annual Meeting of the Association for Computational Linguistics
Publication date
2018-05-09
Fields of study
Mathematics, Linguistics, Computer Science
Identifiers
DOI 10.18653/v1/P18-1072 arXiv 1805.03620
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

GENERATIVE ADVERSARIAL NETS
2018cited by this paper
Phrase-Based & Neural Unsupervised Machine Translation
2018cited by this paper
Learning bilingual word embeddings with (almost) no bilingual data
2017influential reference
Word Translation Without Parallel Data
2017influential reference
Unsupervised Neural Machine Translation
2017cited by this paper
Unsupervised Machine Translation Using Monolingual Corpora Only
2017cited by this paper
Adversarial Training for Unsupervised Bilingual Lexicon Induction
2017influential reference
Offline bilingual word vectors, orthogonal transformations and the inverted softmax
2017cited by this paper
A Survey of Cross-lingual Word Embedding Models
2017cited by this paper
SemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity
2017influential reference
Enriching Word Vectors with Subword Information
2016influential reference
Cross-lingual Models of Word Embeddings: An Empirical Comparison
2016cited by this paper
SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity
2016cited by this paper
A Strong Baseline for Learning Cross-Lingual Word Embeddings from Sentence Alignments
2016influential reference
On the Role of Seed Lexicons in Learning Bilingual Word Embeddings
2016cited by this paper
Finnish web corpus fiWaC 1.0
2016cited by this paper
Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders
2016cited by this paper
Domain-Adversarial Training of Neural Networks
2015cited by this paper
Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation
2015cited by this paper
Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction
2015cited by this paper
Improving zero-shot learning by mitigating the hubness problem
2014cited by this paper
A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)
2013cited by this paper
Exploiting Similarities among Languages for Machine Translation
2013influential reference
Polyglot: Distributed Word Representations for Multilingual NLP
2013influential reference
Distributed Representations of Words and Phrases and their Compositionality
2013influential reference
Inducing Crosslingual Distributed Representations of Words
2012cited by this paper
A New Spectral Technique Using Normalized Adjacency Matrices for Graph Matching
2011cited by this paper
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
2010cited by this paper
News from OPUS — A collection of multilingual parallel corpora with tools and interfaces
2009cited by this paper
Learning Bilingual Lexicons from Monolingual Corpora
2008cited by this paper
Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification
2007cited by this paper
Europarl: A Parallel Corpus for Statistical Machine Translation
2005cited by this paper
An Improved Algorithm for Matching Large Graphs
2001cited by this paper
One cannot hear the shape of a drum
1992cited by this paper
The Role of Priors in Active Bayesian Learning in the Sequential Statistical Decision Framework
1991cited by this paper

CITED BY

Word-level Cross-lingual Structure in Large Language Models
2025cites this paper
Bilingual language processing relies on shared semantic representations that are modulated by each language
2025cites this paper
Few-Shot Learning Translation from New Languages
2025cites this paper
A Bibliometric Analysis of Embedding Techniques for Addressing Meaning Conflation Deficiency in Low-Resourced Languages
2025cites this paper
Cross-Domain Bilingual Lexicon Induction via Pretrained Language Models
2025cites this paper
MACE: Morphology Aware Cross-Lingual Embedding Using Contrastive Learning
2025cites this paper
How Good is BLI as an Alignment Measure: A Study in Word Embedding Paradigm
2025cites this paper
Structure-Aware Dual Adversarial Autoencoder for Unsupervised Bilingual Lexicon Induction
2025influential citation
Lost in Alignment: A Survey on Cross-Lingual Alignment Methods for Contextualized Representation
2025influential citation
SeNSe: embedding alignment via semantic anchors selection
2024cites this paper
Low Resource Arabic Dialects Transformer Neural Machine Translation Improvement through Incremental Transfer of Shared Linguistic Features
2024cites this paper
When Elote, Choclo and Mazorca are not the Same. Isomorphism-Based Perspective to the Spanish Varieties Divergences
2024cites this paper
Concept Space Alignment in Multilingual LLMs
2024cites this paper
The Shape of Word Embeddings: Quantifying Non-Isometry with Topological Data Analysis
2024cites this paper
Unsupervised Bilingual Lexicon Induction for Low Resource Languages
2024cites this paper
A survey on multilingual large language models: corpora, alignment, and bias
2024cites this paper
Cross-lingual Contextualized Phrase Retrieval
2024cites this paper
A survey of neural-network-based methods utilising comparable data for finding translation equivalents
2024influential citation
Enhancing isomorphism between word embedding spaces for distant languages bilingual lexicon induction
2024cites this paper
Representational Isomorphism and Alignment of Multilingual Large Language Models
2024cites this paper
How Lexical is Bilingual Lexicon Induction?
2024cites this paper
Alignment of Multilingual Embeddings to Estimate Job Similarities in Online Labour Market
2024cites this paper
Understanding Cross-Lingual Alignment - A Survey
2024cites this paper
Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment
2024cites this paper
Decipherment-Aware Multilingual Learning in Jointly Trained Language Models
2024cites this paper
DM-BLI: Dynamic Multiple Subspaces Alignment for Unsupervised Bilingual Lexicon Induction
2024cites this paper
Cross-Lingual Word Embedding Generation Based on Procrustes-Hungarian Linear Projection
2024cites this paper
Clustering of Monolingual Embedding Spaces
2023influential citation
Do Vision and Language Models Share Concepts? A Vector Space Alignment Study
2023influential citation
Unsupervised Neural Machine Translation between the Portuguese language and the Chinese and Korean languages
2023cites this paper
Bilingual word embedding fusion for robust unsupervised bilingual lexicon induction
2023cites this paper
Grounding the Vector Space of an Octopus: Word Meaning from Raw Text
2023cites this paper
GARI: Graph Attention for Relative Isomorphism of Arabic Word Embeddings
2023cites this paper
Cultural Adaptation of Recipes
2023cites this paper
Code-switching as a cross-lingual Training Signal: an Example with Unsupervised Bilingual Embedding
2023cites this paper
ProMap: Effective Bilingual Lexicon Induction via Language Model Prompting
2023cites this paper
English-Manipuri Cross-Lingual Embedding: A Preliminary Study
2023cites this paper
FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models
2023cites this paper
Unsupervised Cross-lingual Word Embedding Representation for English-isiZulu
2023cites this paper
Implications of the Convergence of Language and Vision Model Geometries
2023influential citation
MUSEDA: Multilingual Unsupervised and Supervised Embedding for Domain Adaption
2023cites this paper
Accessing Higher Dimensions for Unsupervised Word Translation
2023influential citation
Low-resource Bilingual Dialect Lexicon Induction with Large Language Models
2023cites this paper
Dual Word Embedding for Robust Unsupervised Bilingual Lexicon Induction
2023cites this paper
Beyond Shared Vocabulary: Increasing Representational Word Similarities across Languages for Multilingual Machine Translation
2023cites this paper
Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction
2023cites this paper
Domain Adaptation: Challenges, Methods, Datasets, and Applications
2023cites this paper
Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese
2023cites this paper
Multilingual Word Embeddings for Low-Resource Languages using Anchors and a Chain of Related Languages
2023cites this paper
A Structure-Aware Generative Adversarial Network for Bilingual Lexicon Induction
2023influential citation
A Novel Unsupervised Approach for Cross-Lingual Word Alignment in Low Isomorphic Embedding Spaces
2023cites this paper
Learning bilingual word embedding for automatic text summarization in low resource language
2023cites this paper
GRI: Graph-based Relative Isomorphism of Word Embedding Spaces
2023cites this paper
Did AI get more negative recently?
2022influential citation
The (Undesired) Attenuation of Human Biases by Multilinguality
2022cites this paper
Improving Low-Resource Languages in Pre-Trained Multilingual Language Models
2022cites this paper
Systematic Investigation of Strategies Tailored for Low-Resource Settings for Low-Resource Dependency Parsing
2022cites this paper
Bilingual Lexicon Induction for Low-Resource Languages using Graph Matching via Optimal Transport
2022cites this paper
Multi-Stage Framework with Refinement Based Point Set Registration for Unsupervised Bi-Lingual Word Alignment
2022influential citation
Don’t Forget Cheap Training Signals Before Building Unsupervised Bilingual Word Embeddings
2022influential citation
IsoVec: Controlling the Relative Isomorphism of Word Embedding Spaces
2022influential citation
The Impact of Cross-Lingual Adjustment of Contextual Word Representations on Zero-Shot Transfer
2022cites this paper
Challenges and Strategies in Cross-Cultural NLP
2022cites this paper
A cross-lingual sentence pair interaction feature capture model based on pseudo-corpus and multilingual embedding
2022cites this paper
Cross-lingual Feature Extraction from Monolingual Corpora for Low-resource Unsupervised Bilingual Lexicon Induction
2022cites this paper
Manipuri–English comparable corpus for cross-lingual studies
2022cites this paper
Delving Deeper into Cross-lingual Visual Question Answering
2022cites this paper
Detecting Stance in Scientific Papers: Did we get more Negative Recently?
2022cites this paper
Beyond English: Considering Language and Culture in Psychological Text Analysis
2022cites this paper
Constrained Density Matching and Modeling for Cross-lingual Alignment of Contextualized Representations
2022cites this paper
Unsupervised Alignment of Distributional Word Embeddings
2022cites this paper
Improving Word Translation via Two-Stage Contrastive Learning
2022cites this paper
Robust Unsupervised Cross-Lingual Word Embedding using Domain Flow Interpolation
2022influential citation
Isomorphic Cross-lingual Embeddings for Low-Resource Languages
2022influential citation
Sub-Word Alignment is Still Useful: A Vest-Pocket Method for Enhancing Low-Resource Machine Translation
2022cites this paper
Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment
2022cites this paper
Domain Mismatch Doesn’t Always Prevent Cross-lingual Transfer Learning
2022influential citation
Improving Translation of Out Of Vocabulary Words using Bilingual Lexicon Induction in Low-Resource Machine Translation
2022cites this paper
RAPO: An Adaptive Ranking Paradigm for Bilingual Lexicon Induction
2022cites this paper
AutoMap: Automatic Medical Code Mapping for Clinical Prediction Model Deployment
2022cites this paper
An Analysis of Euclidean vs. Graph-Based Framing for Bilingual Lexicon Induction from Word Embedding Spaces
2021influential citation
English–Welsh Cross-Lingual Embeddings
2021cites this paper
NUIG at TIAD 2021: Cross-lingual word embeddings for translation inference
2021cites this paper
Do Language Models Know the Way to Rome?
2021cites this paper
Leveraging Vector Space Similarity for Learning Cross-Lingual Word Embeddings: A Systematic Review
2021cites this paper
Multi-source Learning via Completion of Block-wise Overlapping Noisy Matrices
2021cites this paper
Semantics in High-Dimensional Space
2021cites this paper
Disentangled Code Representation Learning for Multiple Programming Languages
2021cites this paper
Subword Mapping and Anchoring across Languages
2021influential citation
Examining Cross-lingual Contextual Embeddings with Orthogonal Structural Probes
2021cites this paper
LexFit: Lexical Fine-Tuning of Pretrained Language Models
2021cites this paper
Fully unsupervised word translation from cross-lingual word embeddings especially for healthcare professionals
2021cites this paper
Neural Machine Translation for Low-resource Languages: A Survey
2021cites this paper
Filtered Inner Product Projection for Crosslingual Embedding Alignment
2021cites this paper
Learning a Reversible Embedding Mapping using Bi-Directional Manifold Alignment
2021cites this paper
Itihasa: A large-scale corpus for Sanskrit to English translation
2021cites this paper
Combining Static Word Embeddings and Contextual Representations for Bilingual Lexicon Induction
2021cites this paper
Cross-lingual alignments of ELMo contextual embeddings
2021cites this paper
Word Embedding Transformation for Robust Unsupervised Bilingual Lexicon Induction
2021cites this paper
Cross-Lingual BERT Contextual Embedding Space Mapping with Isotropic and Isometric Conditions
2021cites this paper