Mutual Information Maximization for Simple and Accurate Part-Of-Speech Induction

Published 2018 in North American Chapter of the Association for Computational Linguistics

ABSTRACT

We address part-of-speech (POS) induction by maximizing the mutual information between the induced label and its context. We focus on two training objectives that are amenable to stochastic gradient descent (SGD): a novel generalization of the classical Brown clustering objective and a recently proposed variational lower bound. While both objectives are subject to noise in gradient updates, we show through analysis and experiments that the variational lower bound is robust whereas the generalized Brown objective is vulnerable. We obtain strong performance on a multitude of datasets and languages with a simple architecture that encodes morphology and context.

PUBLICATION RECORD

Publication year
2018
Venue
North American Chapter of the Association for Computational Linguistics
Publication date
2018-04-20
Fields of study
Computer Science
Identifiers
DOI 10.18653/v1/N19-1113 arXiv 1804.07849
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Information Theoretic Co-Training
2018influential reference
Unsupervised Learning of Syntactic Structure with Invertible Neural Projections
2018cited by this paper
Learning deep representations by mutual information estimation and maximization
2018cited by this paper
GENERATIVE ADVERSARIAL NETS
2018cited by this paper
Representation Learning with Contrastive Predictive Coding
2018cited by this paper
Mutual Information Neural Estimation
2018cited by this paper
Inter-annotator Agreement
2017cited by this paper
DyNet: The Dynamic Neural Network Toolkit
2017cited by this paper
Style Transfer from Non-Parallel Text by Cross-Alignment
2017cited by this paper
Adversarially Regularized Autoencoders
2017cited by this paper
Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models
2016influential reference
InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
2016cited by this paper
Unsupervised Neural Hidden Markov Models
2016cited by this paper
A Theoretically Grounded Application of Dropout in Recurrent Neural Networks
2015cited by this paper
Model-based Word Embeddings from Decompositions of Count Matrices
2015cited by this paper
Deep Unordered Composition Rivals Syntactic Methods for Text Classification
2015cited by this paper
Generating Sentences from a Continuous Space
2015cited by this paper
Segmental Recurrent Neural Networks
2015cited by this paper
Deep learning and the information bottleneck principle
2015cited by this paper
Stochastic optimization for deep CCA via nonlinear orthogonal iterations
2015cited by this paper
Unsupervised POS Induction with Word Embeddings
2015influential reference
Semi-supervised Learning with Deep Generative Models
2014cited by this paper
Conditional Random Field Autoencoders for Unsupervised Structured Prediction
2014influential reference
GloVe: Global Vectors for Word Representation
2014cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters
2013cited by this paper
Universal Dependency Annotation for Multilingual Parsing
2013cited by this paper
Equitability, mutual information, and the maximal information coefficient
2013cited by this paper
Deep Canonical Correlation Analysis
2013cited by this paper
Auto-Encoding Variational Bayes
2013cited by this paper
Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?
2011cited by this paper
Two Decades of Unsupervised POS Induction: How Far Have We Come?
2010cited by this paper
Painless Unsupervised Learning with Features
2010influential reference
Simple Semi-supervised Dependency Parsing
2008cited by this paper
A Bayesian LDA-based model for semi-supervised part-of-speech tagging
2007cited by this paper
V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure
2007cited by this paper
Prototype-Driven Learning for Sequence Models
2006cited by this paper
Contrastive Estimation: Training Log-Linear Models on Unlabeled Data
2005cited by this paper
Information-theoretic co-clustering
2003cited by this paper
The information bottleneck method
2000cited by this paper
Combining labeled and unlabeled data with co-training
1998cited by this paper
Unsupervised Word Sense Disambiguation Rivaling Supervised Methods
1995cited by this paper
Tagging English Text with a Probabilistic Model
1994cited by this paper
Class-Based n-gram Models of Natural Language
1992influential reference
Reinforcement Learning for Robots Using Neural Networks
1992cited by this paper
Word Association Norms, Mutual Information, and Lexicography
1989cited by this paper
Statistical Inference for Probabilistic Functions of Finite State Markov Chains
1966cited by this paper
Relations Between Two Sets of Variates
1936cited by this paper

CITED BY

InfoBridge: Mutual Information estimation via Bridge Matching
2025cites this paper
Information theory for complex systems scientists: What, why, and how
2025cites this paper
INFO-SEDD: Continuous Time Markov Chains as Scalable Information Metrics Estimators
2025cites this paper
Information Theoretic Text-to-Image Alignment
2024cites this paper
Inducing Systematicity in Transformers by Attending to Structurally Quantized Embeddings
2024influential citation
MINDE: Mutual Information Neural Diffusion Estimation
2023cites this paper
Review of Unsupervised POS Tagging and Its Implications on Language Acquisition
2023influential citation
An Information-theoretic Approach to Prompt Engineering Without Ground Truth Labels
2022cites this paper
Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging
2022cites this paper
MPII: Multi-Level Mutual Promotion for Inference and Interpretation
2022cites this paper
MVP: Multi-task Supervised Pre-training for Natural Language Generation
2022cites this paper
Masked Part-Of-Speech Model: Does Modeling Long Context Help Unsupervised POS-tagging?
2022influential citation
MICO: Selective Search with Mutual Information Co-training
2022cites this paper
Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning
2022cites this paper
Mutual Exclusivity Training and Primitive Augmentation to Induce Compositionality
2022cites this paper
On planetary systems as ordered sequences
2021influential citation
Decomposed Mutual Information Estimation for Contrastive Representation Learning
2021cites this paper
CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations
2021cites this paper
Recurrent Neural Hidden Markov Model for High-order Transition
2021influential citation
CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision
2021cites this paper
Learning Discrete Structured Representations by Adversarially Maximizing Mutual Information
2020cites this paper
Semi-supervised Autoencoding Projective Dependency Parsing
2020cites this paper
Deep Clustering of Text Representations for Supervision-Free Probing of Syntax
2020influential citation
Clustering Contextualized Representations of Text for Unsupervised Syntax Induction
2020influential citation
Semi-supervised Parsing with a Variational Autoencoding Parser
2020cites this paper
Compound Probabilistic Context-Free Grammars for Grammar Induction
2019influential citation
Formal Limitations on the Measurement of Mutual Information
2018cites this paper
Deep Latent Variable Models of Natural Language
2018cites this paper