Letter Sequence Labeling for Compound Splitting

Published 2016 in Special Interest Group on Computational Morphology and Phonology Workshop

ABSTRACT

For languages such as German where compounds occur frequently and are written as single tokens, a wide variety of NLP applications beneﬁts from recognizing and splitting compounds. As the traditional word frequency-based approach to compound splitting has several drawbacks, this paper introduces a letter sequence labeling approach, which can utilize rich word form features to build discriminative learning models that are optimized for splitting. Experiments show that the proposed method signiﬁcantly outperforms state-of-the-art compound splitters.

PUBLICATION RECORD

Publication year
2016
Venue
Special Interest Group on Computational Morphology and Phonology Workshop
Publication date
Unknown publication date
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.18653/v1/W16-2012
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Proceedings of the 21st Annual Conference of the European Association for Machine Translation
2018cited by this paper
Accurate Linear-Time Chinese Word Segmentation via Embedding Matching
2015cited by this paper
Chasing the Perfect Splitter: A Comparison of Different Compound Splitting Tools
2014cited by this paper
An explicit statistical model of learning lexical segmentation using multiple cues
2014cited by this paper
Experiments with crowdsourced re-annotation of a POS tagging data set
2014cited by this paper
Analyzing and Aligning German compound nouns
2012cited by this paper
Determining Immediate Constituents of Compounds in GermaNet
2011cited by this paper
Language-independent compound splitting with morphological operations
2011cited by this paper
GernEdiT - The GermaNet Editing Tool
2010cited by this paper
Practical Very Large Scale CRFs
2010cited by this paper
How to Avoid Burning Ducks: Combining Linguistic Analysis and Corpus Statistics for German Compound Processing
2010cited by this paper
The pay-offs of preprocessing for German-English statistical machine translation
2010cited by this paper
Using a maximum entropy model to build segmentation lattices for MT
2009cited by this paper
Morphological pre-processing for Turkish to English statistical machine translation
2009cited by this paper
German Decompounding in a Difficult Corpus
2008cited by this paper
Decompounding query keywords from compounding languages
2008cited by this paper
Processing of Swedish compounds for phrase-based statistical machine translation
2008cited by this paper
German Compounds in Factored Statistical Machine Translation
2008cited by this paper
Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner
2007cited by this paper
Statistical Machine Translation of German Compound Words
2006influential reference
TAGH: A Complete Morphology for German Based on Weighted Finite State Automata
2005cited by this paper
Chinese Segmentation and New Word Detection using Conditional Random Fields
2004influential reference
Empirical Methods for Compound Splitting
2003influential reference
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
2001influential reference
Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches
2000cited by this paper
GermaNet - a Lexical-Semantic Net for German
1997cited by this paper
From Phoneme to Morpheme
1955cited by this paper

CITED BY

Word Formation Analyzer for Czech: Automatic Parent Retrieval and Classification of Word Formation Processes
2022cites this paper
Interpreting Statistical Models for Denominal Adjective Formation in Russian
2022cites this paper
Splitting and Identifying Czech Compounds: A Pilot Study
2021cites this paper
Semi-supervised URL Segmentation with Recurrent Neural Networks Pre-trained on Knowledge Graph Entities
2020cites this paper
Building and Exploiting Lexical Databases for Morphological Parsing
2019cites this paper
MiNgMatch - A Fast N-gram Model for Word Segmentation of the Ainu Language
2019cites this paper
Augmenting a German Morphological Database by Data-Intense Methods
2019cites this paper
Combining Data-Intense and Compute-Intense Methods for Fine-Grained Morphological Analyses
2019cites this paper
AkkuBohrHammer vs. AkkuBohrhammer: Experiments towards the Evaluation of Compound Splitting Tools for General Language and Specific Domains
2019cites this paper
Merging the Trees - Building a Morphological Treebank for German from Two Resources
2018cites this paper
Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks
2018influential citation
Building a Morphological Treebank for German from a Linguistic Database
2018cites this paper
Converting the TüBa-D/Z Treebank of German to Universal Dependencies
2017cites this paper