Improving Trajectory Modelling for DNN-Based Speech Synthesis by Using Stacked Bottleneck Features and Minimum Generation Error Training

Published 2016 in IEEE/ACM Transactions on Audio Speech and Language Processing

ABSTRACT

We propose two novel techniques-stacking bottleneck features and minimum generation error (MGE) training criterion-to improve the performance of deep neural network (DNN)-based speech synthesis. The techniques address the related issues of frame-by-frame independence and ignorance of the relationship between static and dynamic features, within current typical DNN-based synthesis frameworks. Stacking bottleneck features, which are an acoustically informed linguistic representation, provides an efficient way to include more detailed linguistic context at the input. The MGE training criterion minimises overall output trajectory error across an utterance, rather than minimising the error per frame independently, and thus takes into account the interaction between static and dynamic features. The two techniques can be easily combined to further improve performance. We present both objective and subjective results that demonstrate the effectiveness of the proposed techniques. The subjective results show that combining the two techniques leads to significantly more natural synthetic speech than from conventional DNN or long short-term memory recurrent neural network systems.

PUBLICATION RECORD

Publication year
2016
Venue
IEEE/ACM Transactions on Audio Speech and Language Processing
Publication date
2016-02-22
Fields of study
Computer Science
Identifiers
DOI 10.1109/TASLP.2016.2551865 arXiv 1602.06727
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Explorer A reading list of recent advances in speech synthesis
2017cited by this paper
Top Downloads in IEEE Xplore [Reader's Choice]
2017cited by this paper
Deep neural network-guided unit selection synthesis
2016cited by this paper
Investigating gated recurrent neural networks for speech synthesis
2016cited by this paper
From HMMS to DNNS: Where do the improvements come from?
2016cited by this paper
Modelling acoustic feature dependencies with artificial neural networks: Trajectory-RNADE
2015cited by this paper
Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis
2015cited by this paper
Minimum trajectory error training for deep neural networks, combined with stacked bottleneck features
2015cited by this paper
Sequence generation error (SGE) minimization based deep neural networks training for text-to-speech synthesis
2015cited by this paper
Towards minimum perceptual error training for DNN-based speech synthesis
2015cited by this paper
Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis
2015cited by this paper
Deep neural network context embeddings for model selection in rich-context HMM synthesis
2015cited by this paper
Attributing modelling errors in HMM synthesis by stepping gradually from natural to modelled speech
2015cited by this paper
Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends
2015cited by this paper
The effect of neural networks in statistical parametric speech synthesis
2015cited by this paper
A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis
2015cited by this paper
Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning
2015cited by this paper
TTS synthesis with bidirectional LSTM based recurrent neural networks
2014cited by this paper
On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis
2014cited by this paper
Statistical parametric speech synthesis using weighted multi-distribution deep belief network
2014cited by this paper
Measuring a decade of progress in Text-to-Speech
2014cited by this paper
A postfilter to modify the modulation spectrum in HMM-based speech synthesis
2014cited by this paper
Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis
2014cited by this paper
Sequence error (SE) minimization training of neural network for voice conversion
2014cited by this paper
The voice bank corpus: Design, collection and data analysis of a large regional accent speech database
2013cited by this paper
Multi-distribution deep belief network for speech synthesis
2013cited by this paper
Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis
2013cited by this paper
Statistical parametric speech synthesis using deep neural networks
2013cited by this paper
Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis
2013cited by this paper
A Practical Guide to Training Restricted Boltzmann Machines
2012cited by this paper
Auto-encoder bottleneck features using deep belief networks
2012cited by this paper
Improved Bottleneck Features Using Pretrained Deep Neural Networks
2011cited by this paper
A perceptual study of acceleration parameters in HMM-based TTS
2010cited by this paper
Optimizing bottle-neck features for lvcsr
2008cited by this paper
Statistical Parametric Speech Synthesis
2007cited by this paper
A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis
2007cited by this paper
Minimum Generation Error Training for HMM-Based Speech Synthesis
2006cited by this paper
Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences
2004cited by this paper
An HMM-based speech synthesis system applied to English
2002cited by this paper
Speech parameter generation algorithms for HMM-based speech synthesis
2000cited by this paper
Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds
1999cited by this paper
A high quality text-to-speech system composed of multiple neural networks
1998cited by this paper
Unit selection in a concatenative speech synthesis system using a large speech database
1996cited by this paper
A neural-network-based model of segmental duration for speech synthesis
1995cited by this paper
Speech synthesis using artificial neural networks trained on cepstral coefficients
1993cited by this paper
Speech synthesis with artificial neural networks
1993cited by this paper
LSP speech synthesis using backpropagation networks
1993cited by this paper
The Design for the Wall Street Journal-based CSR Corpus
1992cited by this paper
Learning representations by back-propagating errors
1986cited by this paper

CITED BY

Modelling of Speech Parameters of Punjabi by Pre-trained Deep Neural Network Using Stacked Denoising Autoencoders
2023cites this paper
An Experimental Analysis of Deep Learning Architectures for Supervised Speech Enhancement
2020cites this paper
A new fuzzy unit selection cost function optimized by relaxed gradient descent algorithm
2020cites this paper
Tonal Contour Generation for Isarn Speech Synthesis Using Deep Learning and Sampling-Based F0 Representation
2020cites this paper
Dynamic Speech Trajectory Based Parameters for Low Resource Languages
2020cites this paper
Augmented Latent Features of Deep Neural Network-Based Automatic Speech Recognition for Motor-Driven Robots
2020cites this paper
The NTUT+III's Chinese Text-to-Speech System for Blizzard Challenge 2019
2019cites this paper
Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra
2019cites this paper
Speech Technology Progress Based on New Machine Learning Paradigm
2019cites this paper
Neural speech synthesis for resource-scarce languages
2019cites this paper
Analysing Shortcomings of Statistical Parametric Speech Synthesis
2018influential citation
Extracting Spectral Features Using Deep Autoencoders With Binary Distributed Hidden Units for Statistical Parametric Speech Synthesis
2018cites this paper
Phone-Level Embeddings for Unit Selection Speech Synthesis
2018cites this paper
Statistical Language and Speech Processing
2018cites this paper
Measuring Uncertainty in Deep Regression Models: The Case of Age Estimation from Speech
2018cites this paper
Articulatory-to-speech Conversion Using Bi-directional Long Short-term Memory
2018cites this paper
DNN-Based Talking Movie Generation Using Minimum Generation Error Criterion and Adversarial Learning with Face Direction Consideration
2018cites this paper
Augmenting Bottleneck Features of Deep Neural Network Employing Motor State for Speech Recognition at Humanoid Robots
2018cites this paper
Performance evaluation of front- and back-end techniques for ASV spoofing detection systems based on deep features
2018cites this paper
Voice Conversion Using Input-to-Output Highway Networks
2017cites this paper
Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework
2017cites this paper
Denoising Recurrent Neural Network for Deep Bidirectional LSTM Based Voice Conversion
2017cites this paper
Generative Adversarial Network-Based Glottal Waveform Model for Statistical Parametric Speech Synthesis
2017cites this paper
Combining unidirectional long short-term memory with convolutional output layer for high-performance speech synthesis
2017cites this paper
Explorer Merlin : An Open Source Neural Network Speech Synthesis System
2017cites this paper
Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks
2017cites this paper
Training algorithm to deceive Anti-Spoofing Verification for DNN-based speech synthesis
2017cites this paper
Prosody Aware Word-Level Encoder Based on BLSTM-RNNs for DNN-Based Speech Synthesis
2017cites this paper
Individuality-Preserving Speech Synthesis System for Hearing Loss Using Deep Neural Networks
2017cites this paper
Deep Elman recurrent neural networks for statistical parametric speech synthesis
2017influential citation
Predicting articulatory movement from text using deep architecture with stacked bottleneck features
2016cites this paper
Deep features for automatic spoofing detection
2016cites this paper
Contextual Representation using Recurrent Neural Network Hidden State for Statistical Parametric Speech Synthesis
2016influential citation
Idlak Tangle: An Open Source Kaldi Based Parametric Speech Synthesiser Based on DNN
2016cites this paper
Demo of Idlak Tangle, An Open Source DNN-Based Parametric Speech Synthesiser
2016cites this paper
Shape-adaptive image compression using lossy shape coding, SA-prediction, and SA-deblocking
2016cites this paper
Merlin: An Open Source Neural Network Speech Synthesis System
2016cites this paper
On the use of I-vectors and average voice model for voice conversion without parallel data
2016cites this paper