Smoothing a Lexicon-based POS Tagger for Arabic and Hebrew

Published 2007 in SEMITIC@ACL

ABSTRACT

We propose an enhanced Part-of-Speech (POS) tagger of Semitic languages that treats Modern Standard Arabic (henceforth Arabic) and Modern Hebrew (henceforth Hebrew) using the same probabilistic model and architectural setting. We start out by porting an existing Hidden Markov Model POS tagger for Hebrew to Arabic by exchanging a morphological analyzer for Hebrew with Buckwalter's (2002) morphological analyzer for Arabic. This gives state-of-the-art accuracy (96.12%), comparable to Habash and Rambow's (2005) analyzer-based POS tagger on the same Arabic datasets. However, further improvement of such analyzer-based tagging methods is hindered by the incomplete coverage of standard morphological analyzer (Bar Haim et al., 2005). To overcome this coverage problem we supplement the output of Buckwalter's analyzer with synthetically constructed analyses that are proposed by a model which uses character information (Diab et al., 2004) in a way that is similar to Nakagawa's (2004) system for Chinese and Japanese. A version of this extended model that (unlike Nakagawa) incorporates synthetically constructed analyses also for known words achieves 96.28% accuracy on the standard Arabic test set.

PUBLICATION RECORD

Publication year
2007
Venue
SEMITIC@ACL
Publication date
2007-06-28
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.3115/1654576.1654593
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop
2005influential reference
Choosing an Optimal Architecture for Segmentation and POS-Tagging of Modern Hebrew
2005cited by this paper
Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools
2004cited by this paper
Chinese and Japanese Word Segmentation Using Word-Level and Character-Level Information
2004cited by this paper
Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks
2004influential reference
Language Model Based Arabic Word Segmentation
2003cited by this paper
Foundations of Statistical Natural Language Processing
2001cited by this paper
Book Reviews: Foundations of Statistical Natural Language Processing
1999cited by this paper
Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging
1995cited by this paper
Coping with Ambiguity and Unknown Words through Probabilistic Models
1993cited by this paper
A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text
1988cited by this paper

CITED BY

Arabic Natural Language Processing (NLP): A Comprehensive Review of Challenges, Techniques, and Emerging Trends
2025cites this paper
Experimenting Machine-Learning Algorithms for Morphological Disambiguation of Arabic Texts
2022cites this paper
PART OF SPEECH TAGGER FOR ARABIC TEXT BASED SUPPORT VECTOR MACHINES: A REVIEW
2019cites this paper
Recherche d'Information Possibiliste: De la Désambiguïsation et la Reformulation de Requêtes vers la Fiabilité de l'Information Recherchée
2018cites this paper
Part of Speech Tagging for Arabic Long Sentence
2018cites this paper
The 8 th International Conference on Emerging Ubiquitous Systems and Pervasive Networks ( EUSPN 2017 ) The Use of Hidden Markov Model in Natural ARABIC Language Processing : a survey
2017cites this paper
The Use of Hidden Markov Model in Natural ARABIC Language Processing: a survey
2017cites this paper
Probabilistic Arabic part of speech tagger with unknown words handling
2016influential citation
Linking Arabic social media based on similarity and sentiment
2016cites this paper
Le traitement automatique de l’arabe dialectalisé : aspects méthodologiques et algorithmiques
2015cites this paper
Joint Arabic Segmentation and Part-Of-Speech Tagging
2015cites this paper
Automatic Domain-Relevant Collocation Extraction from Arabic Corpus
2015cites this paper
Statistical Machine Translation
2014cites this paper
Improving Arabic Tokenization and POS Tagging Using Morphological Analyzer
2014cites this paper
Evaluation of a possibilistic classification approach for Arabic texts disambiguation (Evaluation d’une approche de classification possibiliste pour la désambiguïsation des textes arabes) [in French]
2014cites this paper
Statistical-based System for Morphological Annotation of Arabic Texts
2013cites this paper
depuis l'arabe
2013cites this paper
Amélioration des systèmes de traduction par analyse linguistique et thématique : application à la traduction depuis l'arabe. (Improvements for Machine Translation Systems Using Linguistic and Thematic Analysis : an Application to the Translation from Arabic)
2013cites this paper
Arabic Morphosyntactic Raw Text Part of Speech Tagging System
2013cites this paper
Arabic Named Entity Recognition: A Corpus-Based Study
2012cites this paper
Arabic-Segmentation Combination Strategies for Statistical Machine Translation
2012cites this paper
Joint Segmentation and POS Tagging for Arabic Using a CRF-based Classifier
2012cites this paper
A comparison of segmentation methods and extended lexicon models for Arabic statistical machine translation
2012cites this paper
An efficient part-of-speech tagger for arabic
2011cites this paper
MorphTagger: HMM-based Arabic segmentation for statistical machine translation
2010influential citation
The RWTH Aachen Machine Translation System for WMT 2010
2010cites this paper
A Probabilistic Morphological Analyzer for Syriac
2010cites this paper
Between Logic and Common Sense: The Formal Semantics of Words
2010cites this paper
The RWTH Aachen machine translation system for IWSLT 2010
2010cites this paper
Robust machine translation for multi-domain tasks
2010cites this paper
Classifiers combination to arabic morphosyntactic disambiguation
2009cites this paper
The RWTH Machine Translation System for WMT 2009
2009cites this paper
Arabic part of speech tagging using Tranformation-Based Learning
2009cites this paper
Arabic Part Of Speech Disambiguation: A Survey
2009cites this paper
The RWTH machine translation system for IWSLT 2008.
2008influential citation
Automatic Annotation of Morpho-Syntactic Dependencies in a Modern Hebrew Treebank
2008cites this paper
Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis
2008cites this paper