Nonparametric Bayesian Semi-supervised Word Segmentation

Published 2017 in Transactions of the Association for Computational Linguistics

ABSTRACT

This paper presents a novel hybrid generative/discriminative model of word segmentation based on nonparametric Bayesian methods. Unlike ordinary discriminative word segmentation which relies only on labeled data, our semi-supervised model also leverages a huge amounts of unlabeled text to automatically learn new “words”, and further constrains them by using a labeled data to segment non-standard texts such as those found in social networking services. Specifically, our hybrid model combines a discriminative classifier (CRF; Lafferty et al. (2001) and unsupervised word segmentation (NPYLM; Mochihashi et al. (2009)), with a transparent exchange of information between these two model structures within the semi-supervised framework (JESS-CM; Suzuki and Isozaki (2008)). We confirmed that it can appropriately segment non-standard texts like those in Twitter and Weibo and has nearly state-of-the-art accuracy on standard datasets in Japanese, Chinese, and Thai.

PUBLICATION RECORD

Publication year
2017
Venue
Transactions of the Association for Computational Linguistics
Publication date
2017-06-23
Fields of study
Computer Science
Identifiers
DOI 10.1162/tacl_a_00054
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Transition-Based Neural Word Segmentation
2016cited by this paper
Inducing Word and Part-of-Speech with Pitman-Yor Hidden Semi-Markov Models
2015cited by this paper
Gated Recursive Neural Network for Chinese Word Segmentation
2015cited by this paper
Feature-Frequency–Adaptive On-line Training for Fast and Accurate Natural Language Processing
2014cited by this paper
Iterative Bayesian word segmentation for unsupervised vocabulary discovery from phoneme lattices
2014cited by this paper
Mutual learning of an object concept and language model based on MLDA and NPYLM
2014cited by this paper
Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
2012influential reference
A Nonparametric Bayesian Approach to Acoustic Model Discovery
2012cited by this paper
Density Ratio Estimation in Machine Learning
2012cited by this paper
Enhancing Chinese Word Segmentation Using Unlabeled Data
2011cited by this paper
High-Performance Semi-Supervised Learning using Discriminatively Constrained Generative Models
2010cited by this paper
Nonparametric Word Segmentation for Machine Translation
2010cited by this paper
Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling
2009influential reference
A Discriminative Latent Variable Chinese Segmenter with Hybrid Word/Character Information
2009cited by this paper
Direct Density Ratio Estimation for Large-scale Covariate Shift Adaptation
2009cited by this paper
A Word and Character-Cluster Hybrid Model for Thai Word Segmentation
2009cited by this paper
Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data
2008influential reference
The Infinite Markov Model
2007cited by this paper
KOTONOHA and BCCWJ : Development of a Balanced Corpus of Contemporary Written Japanese
2007cited by this paper
A Hybrid Markov/Semi-Markov Conditional Random Field for Sequence Segmentation
2006influential reference
A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes
2006influential reference
Contextual Dependencies in Unsupervised Word Segmentation
2006cited by this paper
The Second International Chinese Word Segmentation Bakeoff
2005cited by this paper
Semi-Markov Conditional Random Fields for Information Extraction
2004cited by this paper
Applying Conditional Random Fields to Japanese Morphological Analysis
2004cited by this paper
Special issue: finite state methods in natural language processing
2003cited by this paper
Bayesian Methods for Hidden Markov Models
2002cited by this paper
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
2001cited by this paper
A bit of progress in language modeling
2001cited by this paper
A hierarchical Dirichlet language model
1995cited by this paper
A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms
1990cited by this paper

CITED BY

Improve word segmentation performance from unknown language by decreasing meaningless segmentation
2023cites this paper
Joint Persian Word Segmentation Correction and Zero-Width Non-Joiner Recognition Using BERT
2020cites this paper
Unsupervised Learning Helps Supervised Neural Word Segmentation
2019cites this paper
Towards Burmese (Myanmar) Morphological Analysis
2019cites this paper
A scalable framework for cross-lingual authorship identification
2018cites this paper
Natural Language Generation Using Monte Carlo Tree Search
2018cites this paper
Word Segmentation From Phoneme Sequences Based On Pitman-Yor Semi-Markov Model Exploiting Subword Information
2018cites this paper
Towards easier and faster sequence labeling for natural language processing: A search-based probabilistic online learning framework (SAPO)
2015cites this paper
Domain-Speci(cid:12)c Unsupervised Named Entity Recognition
year unknowncites this paper
EasyChair Preprint No 856 A Hybrid Generative / Discriminative Model for Rapid Prototyping of Domain-Specific Named Entity Recognition
year unknowncites this paper