Unbiased scalable softmax optimization

Published 2018 in arXiv.org

ABSTRACT

Recent neural network and language models rely on softmax distributions with an extremely large number of categories. Since calculating the softmax normalizing constant in this context is prohibitively expensive, there is a growing literature of efficiently computable but biased estimates of the softmax. In this paper we propose the first unbiased algorithms for maximizing the softmax likelihood whose work per iteration is independent of the number of classes and datapoints (and no extra work is required at the end of each epoch). We show that our proposed unbiased methods comprehensively outperform the state-of-the-art on seven real world datasets.

PUBLICATION RECORD

Publication year
2018
Venue
arXiv.org
Publication date
2018-02-15
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1803.08577
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Augment and Reduce: Stochastic Inference for Large Categorical Distributions
2018cited by this paper
Aggressive Sampling for Multi-class to Binary Reduction with Applications to Text Classification
2017cited by this paper
Efficient softmax approximation for GPUs
2016cited by this paper
From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification
2016cited by this paper
The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
2016cited by this paper
Simultaneous Learning of Trees and Representations for Extreme Classification, with Application to Language Modeling
2016cited by this paper
Accelerating Stochastic Composition Optimization
2016influential reference
Logarithmic Time One-Against-Some
2016cited by this paper
DS-MLR: Exploiting Double Separability for Scaling up Distributed Multinomial Logistic Regression
2016influential reference
Simultaneous Learning of Trees and Representations for Extreme Classification and Density Estimation
2016cited by this paper
One-vs-Each Approximation to Softmax for Scalable Estimation of Probabilities
2016influential reference
When and why are log-linear models self-normalizing?
2015cited by this paper
Towards Stability and Optimality in Stochastic Gradient Descent
2015influential reference
BlackOut: Speeding up Recurrent Neural Network Language Models With Very Large Vocabularies
2015cited by this paper
An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family
2015cited by this paper
LSHTC: A Benchmark for Large-Scale Text Classification
2015cited by this paper
Implicit stochastic approximation
2015cited by this paper
The proximal Robbins–Monro method
2015cited by this paper
Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets
2014cited by this paper
Distributed training of Large-scale Logistic models
2013cited by this paper
One billion word benchmark for measuring progress in statistical language modeling
2013cited by this paper
A fast and simple algorithm for training neural probabilistic language models
2012influential reference
A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method
2012cited by this paper
Convex optimization
2010cited by this paper
Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model
2008influential reference
Sparse multinomial logistic regression: fast algorithms and generalization bounds
2005cited by this paper
Quick Training of Probabilistic Neural Nets by Importance Sampling
2003cited by this paper
Essential Medical Statistics
2003cited by this paper
Customer satisfaction, customer retention, and market share
1993cited by this paper
The maximum concurrent flow problem
1990cited by this paper
Noname manuscript No. (will be inserted by the editor) Incremental Proximal Methods for Large Scale Convex Optimization
year unknowncited by this paper

CITED BY

A Geometry-Aware Efficient Algorithm for Compositional Entropic Risk Minimization
2026cites this paper
Iterative Distributed Multinomial Regression
2024cites this paper
Discovering signals of platform failure risks from customer sentiment: the case of online P2P lending
2022cites this paper
Soft Labels and Supervised Image Classification (DRAFT: October 7, 2021)
2021cites this paper
Scalable Gaussian Process for Extreme Classification
2020cites this paper
Sub-linear convergence of a stochastic proximal iteration method in Hilbert space
2020cites this paper
Sampled Softmax with Random Fourier Features
2019cites this paper
ADMM-Softmax: an ADMM approach for multinomial logistic regression
2019cites this paper
Large-Scale Classification using Multinomial Regression and ADMM
2019cites this paper
On Fenchel Mini-Max Learning
2019cites this paper
Augment and Reduce: Stochastic Inference for Large Categorical Distributions
2018cites this paper
Compositional Stochastic Average Gradient for Machine Learning and Related Applications
2018cites this paper
Stochastic Negative Mining for Learning with Large Output Spaces
2018cites this paper
Sampled Estimators For Softmax Must Be Biased
year unknowncites this paper