Word discovery is the task of extracting words from un-segmented text. In this paper we examine to what extent neural networks can be applied to this task in a realistic unwritten language scenario, where only small corpora and limited annotations are available. We investigate two scenarios: one with no supervision and another with limited supervision with access to the most frequent words. Obtained results show that it is possible to retrieve at least 27% of the gold standard vocabulary by training an encoder-decoder neural machine translation system with only 5,157 sentences. This result is close to those obtained with a task-specific Bayesian nonparametric model. Moreover, our approach has the advantage of generating translation alignments, which could be used to create a bilingual lexicon. As a future perspective, this approach is also well suited to work directly from speech.
Unwritten languages demand attention too! Word discovery with encoder-decoder models
Marcely Zanon Boito,Alexandre Berard,Aline Villavicencio,L. Besacier
Published 2017 in Automatic Speech Recognition & Understanding
ABSTRACT
PUBLICATION RECORD
- Publication year
2017
- Venue
Automatic Speech Recognition & Understanding
- Publication date
2017-09-17
- Fields of study
Linguistics, Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-31 of 31 references · Page 1 of 1
CITED BY
Showing 1-21 of 21 citing papers · Page 1 of 1