External Lexical Information for Multilingual Part-of-Speech Tagging

Published 2016 in arXiv.org

ABSTRACT

Morphosyntactic lexicons and word vector representations have both proven useful for improving the accuracy of statistical part-of-speech taggers. Here we compare the performances of four systems on datasets covering 16 languages, two of these systems being feature-based (MEMMs and CRFs) and two of them being neural-based (bi-LSTMs). We show that, on average, all four approaches perform similarly and reach state-of-the-art results. Yet better performances are obtained with our feature-based models on lexically richer datasets (e.g. for morphologically rich languages), whereas neural-based results are higher on datasets with less lexical variability (e.g. for English). These conclusions hold in particular for the MEMM models relying on our system MElt, which benefited from newly designed features. This shows that, under certain conditions, feature-based approaches enriched with morphosyntactic lexicons are competitive with respect to neural methods.

PUBLICATION RECORD

Publication year
2016
Venue
arXiv.org
Publication date
2016-06-12
Fields of study
Linguistics, Computer Science
Identifiers
arXiv 1606.03676
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss
2016influential reference
Neural Architectures for Named Entity Recognition
2016cited by this paper
Improved Transition-based Parsing by Modeling Characters instead of Words with LSTMs
2015cited by this paper
Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation
2015cited by this paper
Universal Dependencies 1.4
2015cited by this paper
Robust Morphological Tagging with Word Representations
2015influential reference
DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German
2014cited by this paper
Polyglot: Distributed Word Representations for Multilingual NLP
2013cited by this paper
Efficient Higher-Order CRFs for Morphological Tagging
2013cited by this paper
Text segmentation with character-level text embeddings
2013cited by this paper
Overview of the SPMRL 2013 Shared Task: A Cross-Framework Evaluation of Parsing Morphologically Rich Languages
2013cited by this paper
Data Driven Lemmatization and Parsing of Italian
2012influential reference
Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging
2012influential reference
Statistical Parsing of Spanish and Data Driven Lemmatization
2012cited by this paper
Evaluating the Impact of External Lexical Resources into a CRF-based Multiword Segmenter and Part-of-Speech Tagger
2012cited by this paper
A Universal Part-of-Speech Tagset
2011cited by this paper
Adding Context Information to Part Of Speech Tagging for Dialogues
2010cited by this paper
Bruk av et norsk leksikon til tagging og andre språkteknologiske formål
2010cited by this paper
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora
2010cited by this paper
A Morphological Lexicon for the Persian Language
2010cited by this paper
Improving generative statistical parsing with semi-supervised word clustering
2009cited by this paper
Enhancing Unlexicalized Parsing Performance Using a Wide Coverage Lexicon, Fuzzy Tag-Set Mapping, and EM-HMM-Based Lexical Probabilities
2009cited by this paper
A Morphological and Syntactic Wide-coverage Lexicon for Spanish: The Leffe
2009cited by this paper
Building a Morphosyntactic Lexicon and a Pre-syntactic Processing Chain for Polish
2009cited by this paper
The Hunting of the BLARK – SALDO , a Freely Available Lexical Database for Swedish Language Technology
2008cited by this paper
Learning Morphology with Morfette
2008cited by this paper
A unified architecture for natural language processing: deep neural networks with multitask learning
2008cited by this paper
Morph-it! A free corpus-based morphological resource for the Italian language
2005cited by this paper
Enlarging the Croatian Morphological Lexicon by Automatic Lexical Acquisition from Raw Corpora
2004cited by this paper
A Neural Probabilistic Language Model
2003cited by this paper
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
2001cited by this paper
Morphological Tagging: Data vs. Dictionaries
2000cited by this paper
Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger
2000cited by this paper
TnT – A Statistical Part-of-Speech Tagger
2000cited by this paper
A Computational Lexicon of Portuguese for Automatic Text Parsing
1999cited by this paper
HMM Specialization with Selective Lexicalization
1999cited by this paper
Long Short-Term Memory
1997cited by this paper
A Maximum Entropy Model for Part-Of-Speech Tagging
1996influential reference
Estimating Markov model structures
1996cited by this paper
Statistical Decision-Tree Models for Parsing
1995cited by this paper
Tagging English Text with a Probabilistic Model
1994cited by this paper
Probabilistic part-of-speech tagging using decision trees
1994cited by this paper

CITED BY

Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre
2020cites this paper
OFrLex: A Computational Morphological and Syntactic Lexicon for Old French
2020cites this paper