Task Loss Estimation for Sequence Prediction

Dzmitry Bahdanau,Dmitriy Serdyuk,Philemon Brakel,Nan Rosemary Ke,J. Chorowski,Aaron C. Courville,Yoshua Bengio

Published 2015 in arXiv.org

ABSTRACT

Often, the performance on a supervised machine learning task is evaluated with a emph{task loss} function that cannot be optimized directly. Examples of such loss functions include the classification error, the edit distance and the BLEU score. A common workaround for this problem is to instead optimize a emph{surrogate loss} function, such as for instance cross-entropy or hinge loss. In order for this remedy to be effective, it is important to ensure that minimization of the surrogate loss results in minimization of the task loss, a condition that we call emph{consistency with the task loss}. In this work, we propose another method for deriving differentiable surrogate losses that provably meet this requirement. We focus on the broad class of models that define a score for every input-output pair. Our idea is that this score can be interpreted as an estimate of the task loss, and that the estimation error may be used as a consistent surrogate loss. A distinct feature of such an approach is that it defines the desirable value of the score for every input-output pair. We use this property to design specialized surrogate losses for Encoder-Decoder models often used for sequence prediction tasks. In our experiment, we benchmark on the task of speech recognition. Using a new surrogate loss instead of cross-entropy to train an Encoder-Decoder speech recognizer brings a significant ~13% relative improvement in terms of Character Error Rate (CER) in the case when no extra corpora are used for language modeling.

PUBLICATION RECORD

Publication year
2015
Venue
arXiv.org
Publication date
2015-11-19
Fields of study
Computer Science
Identifiers
arXiv 1511.06456
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Statistical Learning Theory
2021cited by this paper
Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks
2016cited by this paper
Blocks and Fuel: Frameworks for deep learning
2015cited by this paper
EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding
2015influential reference
Attention-Based Models for Speech Recognition
2015cited by this paper
End-to-end attention-based large vocabulary speech recognition
2015cited by this paper
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
2015cited by this paper
How transferable are features in deep neural networks?
2014cited by this paper
Towards End-To-End Speech Recognition with Recurrent Neural Networks
2014cited by this paper
Sequence to Sequence Learning with Neural Networks
2014cited by this paper
Neural Machine Translation by Jointly Learning to Align and Translate
2014cited by this paper
First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs
2014cited by this paper
Deep Speech: Scaling up end-to-end speech recognition
2014influential reference
Training MRF-Based Phrase Translation Models using Gradient Ascent
2013cited by this paper
Theano: new features and speed improvements
2012cited by this paper
ImageNet classification with deep convolutional neural networks
2012cited by this paper
Generic Methods for Optimization-Based Modeling
2012cited by this paper
Direct Error Rate Minimization of Hidden Markov Models
2011cited by this paper
The Kaldi Speech Recognition Toolkit
2011cited by this paper
Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure
2011cited by this paper
Direct Loss Minimization for Structured Prediction
2010cited by this paper
Curriculum learning
2009cited by this paper
Discriminative learning in sequential pattern recognition
2008cited by this paper
Pattern Recognition and Machine Learning
2006cited by this paper
Minimum Risk Annealing for Training Log-Linear Models
2006cited by this paper
A Tutorial on Energy-Based Learning
2006cited by this paper
Large Margin Methods for Structured and Interdependent Output Variables
2005influential reference
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
2004cited by this paper
Minimum Phone Error and I-smoothing for improved discriminative training
2002cited by this paper
Gradient-based learning applied to document recognition
1998cited by this paper

CITED BY

TO-FLOW: Efficient Continuous Normalizing Flows with Temporal Optimization adjoint with Moving Speed
2022cites this paper
Learning with Algorithmic Supervision via Continuous Relaxations
2021cites this paper
Beyond In-Place Corruption: Insertion and Deletion In Denoising Probabilistic Models
2021cites this paper
Convolutional Neural Networks-An Extensive arena of Deep Learning. A Comprehensive Study
2021cites this paper
Imputer: Sequence Modelling via Imputation and Dynamic Programming
2020cites this paper
Token-wise Training for Attention Based End-to-end Speech Recognition
2019cites this paper
Non-Monotonic Sequential Text Generation
2019cites this paper
Ectc-Docd: An End-to-End Structure with CTC Encoder and OCD Decoder for Speech Recognition
2019cites this paper
Correlation Distance Skip Connection Denoising Autoencoder (CDSK-DAE) for Speech Feature Enhancement
2019cites this paper
Bitcoin Price Prediction Through Opinion Mining
2019cites this paper
A Fully Differentiable Beam Search Decoder
2019cites this paper
SEQUENCE TRAINING OF ENCODER-DECODER MODEL USING POLICY GRADIENT FOR END- TO-END SPEECH RECOGNITION
2018influential citation
Abstractive Text Classification Using Sequence-to-convolution Neural Networks
2018cites this paper
Rescoring of N-Best Hypotheses Using Top-Down Selective Attention for Automatic Speech Recognition
2018cites this paper
Netze in der automatischen Spracherkennung-ein Paradigmenwechsel ? Neural Networks in Automatic Speech Recognition-a Paradigm Change ?
2018cites this paper
Optimal Completion Distillation for Sequence Learning
2018cites this paper
Promising Accurate Prefix Boosting for Sequence-to-sequence ASR
2018cites this paper
Reward Only Training of Encoder-Decoder Digit Recognition Systems Based on Policy Gradient Methods
2018cites this paper
Task-oriented learning of structured probability distributions
2017cites this paper
Structured prediction and generative modeling using neural networks
2017influential citation
Translation Quality Estimation Using Only Bilingual Corpora
2017cites this paper
Exploring neural network architectures for acoustic modeling
2017cites this paper
End-to-End Architectures for Speech Recognition
2017cites this paper
Twin Networks: Matching the Future for Sequence Generation
2017cites this paper
Tunable Sensitivity to Large Errors in Neural Network Training
2016cites this paper
Automatic Speech Recognition Based on Neural Networks
2016cites this paper
Very deep convolutional networks for end-to-end speech recognition
2016influential citation
Reward Augmented Maximum Likelihood for Neural Structured Prediction
2016cites this paper
Lattice Based Transcription Loss for End-to-End Speech Recognition
2016influential citation
End-to-End Speech Recognition Models
2016influential citation
Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model
2016cites this paper
Latent Sequence Decompositions
2016influential citation
On Online Attention-Based Speech Recognition and Joint Mandarin Character-Pinyin Training
2016influential citation