Unwritten languages demand attention too! Word discovery with encoder-decoder models

Marcely Zanon Boito,Alexandre Berard,Aline Villavicencio,L. Besacier

Published 2017 in Automatic Speech Recognition & Understanding

ABSTRACT

Word discovery is the task of extracting words from un-segmented text. In this paper we examine to what extent neural networks can be applied to this task in a realistic unwritten language scenario, where only small corpora and limited annotations are available. We investigate two scenarios: one with no supervision and another with limited supervision with access to the most frequent words. Obtained results show that it is possible to retrieve at least 27% of the gold standard vocabulary by training an encoder-decoder neural machine translation system with only 5,157 sentences. This result is close to those obtained with a task-specific Bayesian nonparametric model. Moreover, our approach has the advantage of generating translation alignments, which could be used to create a bilingual lexicon. As a future perspective, this approach is also well suited to work directly from speech.

PUBLICATION RECORD

Publication year
2017
Venue
Automatic Speech Recognition & Understanding
Publication date
2017-09-17
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.1109/ASRU.2017.8268972 arXiv 1709.05631
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

A case study on using speech-to-translation alignments for language documentation
2017cited by this paper
Sequence-to-Sequence Models Can Directly Transcribe Foreign Speech
2017cited by this paper
Six Challenges for Neural Machine Translation
2017cited by this paper
Sequence-to-Sequence Models Can Directly Translate Foreign Speech
2017cited by this paper
Breaking the Unwritten Language Barrier: The BULB Project
2016cited by this paper
Weakly supervised spoken term discovery using cross-lingual side information
2016cited by this paper
The Zero Resource Speech Challenge 2015: Proposed Approaches and Results
2016cited by this paper
Learning a Lexicon and Translation Model from Phoneme Lattices
2016cited by this paper
Toward human-assisted lexical unit discovery without text resources
2016cited by this paper
Fully Character-Level Neural Machine Translation without Explicit Segmentation
2016cited by this paper
An Attentional Model for Speech Translation Without Transcription
2016influential reference
Parallel Speech Collection for Under-resourced Language Studies Using the Lig-Aikuma Mobile Device App
2016cited by this paper
Morphological Segmentation with Window LSTM Neural Networks
2016cited by this paper
Preliminary Experiments on Unsupervised Word Discovery in Mboshi
2016influential reference
Phoneme Boundary Detection using Deep Bidirectional LSTMs
2016cited by this paper
Translate : A Proof of Concept for End-to-End Speech-to-Text Translation
2016influential reference
Unsupervised Lexicon Discovery from Acoustic Input
2015cited by this paper
Inducing Bilingual Lexicons from Small Quantities of Sentence-Aligned Phonemic Transcriptions
2015cited by this paper
Neural Machine Translation by Jointly Learning to Align and Translate
2014influential reference
A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability
2013cited by this paper
A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition
2013cited by this paper
The Cambridge handbook of endangered languages
2011cited by this paper
Recession Segmentation: Simpler Online Word Segmentation Using Limited Resources
2010cited by this paper
A Bayesian framework for word segmentation: exploring the effects of context.
2009influential reference
Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars
2009cited by this paper
Nonparametric bayesian models of lexical acquisition
2007cited by this paper
Moses: Open Source Toolkit for Statistical Machine Translation
2007cited by this paper
TOWARDS SPEECH TRANSLATION OF NON WRITTEN LANGUAGES
2006cited by this paper
Contextual Dependencies in Unsupervised Word Segmentation
2006cited by this paper
Applications and Explanations of Zipf’s Law
1998cited by this paper
Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning
1998cited by this paper

CITED BY

Learning Through Transcription
2022cites this paper
Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings
2021cites this paper
Sparse Transcription
2021cites this paper
AlloST: Low-resource Speech Translation without Source Transcription
2021cites this paper
Local Word Discovery for Interactive Transcription
2021cites this paper
On the Difficulty of Segmenting Words with Attention
2021cites this paper
AlloVera: A Multilingual Allophone Database
2020cites this paper
Investigating Language Impact in Bilingual Approaches for Computational Language Documentation
2020cites this paper
ORTHROS: non-autoregressive end-to-end speech translation With dual-decoder
2020cites this paper
Investigating alignment interpretability for low-resource NMT
2020cites this paper
Unsupervised word discovery for computational language documentation
2019cites this paper
Multilingual End-to-End Speech Translation
2019cites this paper
C L ] 1 O ct 2 01 9 MULTILINGUAL END-TO-END SPEECH TRANSLATION
2019cites this paper
Controlling Utterance Length in NMT-based Word Segmentation with Attention
2019influential citation
Weakly Supervised Attention Networks for Entity Recognition
2019cites this paper
Tied Multitask Learning for Neural Speech Translation
2018influential citation
Collecter, Transcrire, Analyser : quand la machine assiste le linguiste dans son travail de terrain. (Collecting, Transcribing, Analyzing : Machine-Assisted Linguistic Fieldwork)
2018cites this paper
A small Griko-Italian speech translation corpus
2018cites this paper
Unsupervised Word Segmentation from Speech with Attention
2018influential citation
Linguistic Unit Discovery from Multi-Modal Inputs in Unwritten Languages: Summary of the “Speaking Rosetta” JSALT 2017 Workshop
2018cites this paper
A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
2017cites this paper