Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices

H. Zen,Yannis Agiomyrgiannakis,Niels Egberts,Fergus Henderson,Przemyslaw Szczepaniak

Published 2016 in Interspeech

ABSTRACT

Acoustic models based on long short-term memory recurrent neural networks (LSTM-RNNs) were applied to statistical parametric speech synthesis (SPSS) and showed significant improvements in naturalness and latency over those based on hidden Markov models (HMMs). This paper describes further optimizations of LSTM-RNN-based SPSS for deployment on mobile devices; weight quantization, multi-frame inference, and robust inference using an {\epsilon}-contaminated Gaussian loss function. Experimental results in subjective listening tests show that these optimizations can make LSTM-RNN-based SPSS comparable to HMM-based SPSS in runtime speed while maintaining naturalness. Evaluations between LSTM-RNN- based SPSS and HMM-driven unit selection speech synthesis are also presented.

PUBLICATION RECORD

Publication year
2016
Venue
Interspeech
Publication date
2016-06-20
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.21437/Interspeech.2016-522 arXiv 1606.06061
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer
2016influential reference
From HMMS to DNNS: Where do the improvements come from?
2016cited by this paper
Robust TTS duration modelling using DNNS
2016cited by this paper
On the Efficient Representation and Execution of Deep Acoustic Models
2016cited by this paper
Investigating gated recurrent neural networks for speech synthesis
2016cited by this paper
Modelling acoustic feature dependencies with artificial neural networks: Trajectory-RNADE
2015cited by this paper
Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis
2015influential reference
Acoustic Modeling for Speech Synthesis: from HMM to RNN
2015cited by this paper
Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning
2015cited by this paper
Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis
2015cited by this paper
The effect of neural networks in statistical parametric speech synthesis
2015cited by this paper
Vocaine the vocoder and applications in speech synthesis
2015influential reference
Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis
2015cited by this paper
Sentence-level control vectors for deep neural network speech synthesis
2015cited by this paper
Minimum trajectory error training for deep neural networks, combined with stacked bottleneck features
2015cited by this paper
Long short-term memory recurrent neural network architectures for large scale acoustic modeling
2014cited by this paper
Voice source modelling using deep neural networks for statistical parametric speech synthesis
2014cited by this paper
On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis
2014cited by this paper
An investigation of implementation and performance analysis of DNN based speech synthesis system
2014cited by this paper
Modeling DCT parameterized F0 trajectory at intonation phrase level with DNN or decision tree
2014cited by this paper
Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis
2014cited by this paper
TTS synthesis with bidirectional LSTM based recurrent neural networks
2014cited by this paper
Sequence error (SE) minimization training of neural network for voice conversion
2014cited by this paper
Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks
2014cited by this paper
An empirical study of learning rates in deep neural networks for speech recognition
2013cited by this paper
Deep learning in speech synthesis
2013cited by this paper
DEEP NEURAL NETWORKS FOR ACOUSTIC MODELING
2013cited by this paper
Loss Minimization and Parameter Estimation with Heavy Tails
2013cited by this paper
On rectified linear units for speech processing
2013cited by this paper
Statistical parametric speech synthesis using deep neural networks
2013cited by this paper
Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis
2013cited by this paper
Quantized HMMs for low footprint text-to-speech synthesis
2010cited by this paper
Probablistic modelling of F0 in unvoiced regions in HMM based speech synthesis
2009cited by this paper
Statistical Parametric Speech Synthesis
2007influential reference
Simultaneous modeling of phonetic and prosodic parameters,and characteristic conversion for HMM-based text-to-speech systems
2002cited by this paper
Speech parameter generation algorithms for HMM-based speech synthesis
2000cited by this paper
Tail distribution modelling using the richter and power exponential distributions
1999cited by this paper
Vector quantization of speech spectral parameters using statistics of dynamic features
1997cited by this paper
Long Short-Term Memory
1997cited by this paper
Backpropagation Through Time: What It Does and How to Do It
1990cited by this paper
The acoustic-modeling problem in automatic speech recognition
1987cited by this paper
Static and Dynamic Error Propagation Networks with Application to Speech Coding
1987cited by this paper
A survey of sampling from contaminated distributions
1960cited by this paper

CITED BY

One-class neural network with hybrid pooling on dual-band frequency for spoofing speech detection
2026cites this paper
Evaluation framework for deepfake speech detection: a comparative study of state-of-the-art deepfake speech detectors
2025cites this paper
Performance Optimization and Practical Exploration of Transformer Architecture in Speech Synthesis
2025cites this paper
A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection
2024cites this paper
Pre-training Neural Transducer-based Streaming Voice Conversion for Faster Convergence and Alignment-free Training
2024cites this paper
ArSa-Tweets: A novel Arabic sarcasm detection system based on deep learning model
2024cites this paper
Deepfake Speech Detection: A Spectrogram Analysis
2024cites this paper
One-Class Neural Network With Directed Statistics Pooling for Spoofing Speech Detection
2024cites this paper
The empirical study of tweet classification system for disaster response using shallow and deep learning models
2024cites this paper
Blind and Low-Vision Individuals' Detection of Audio Deepfakes
2024cites this paper
Advancing NLP for Underrepresented Languages: A Data-Driven Study on Shahmukhi Punjabi to Retrieve NER, Using RNN and LSTM
2024cites this paper
Towards more flexible human-machine speech communication
2023cites this paper
Whisper Model Adaptation for FSR-2023 Hakka Speech Recognition Challenge
2023cites this paper
Data-driven Communicative Behaviour Generation: A Survey
2023cites this paper
E2E-DASR: End-to-end deep learning-based dysarthric automatic speech recognition
2023cites this paper
On incorporating social speaker characteristics in synthetic speech
2022cites this paper
Automatic Speech Recognition Using Limited Vocabulary: A Survey
2022cites this paper
On the Use of LSTM-RNN for Detecting Audio Spoofing Attacks
2022cites this paper
FCH-TTS: Fast, Controllable and High-quality Non-Autoregressive Text-to-Speech Synthesis
2022cites this paper
Synthesising Audio Adversarial Examples for Automatic Speech Recognition
2022cites this paper
C-PAK: Correcting and Completing Variable-Length Prefix-Based Abbreviated Keystrokes
2022cites this paper
Investigating a neural all pass warp in modern TTS applications
2022cites this paper
Detection of Doctored Speech: Towards an End-to-End Parametric Learn-able Filter Approach
2022cites this paper
Master Computer Science Enginetron: realtime car exhaust note synthesis using on-board diagnostics through Text-to-Speech networks
2022cites this paper
Deep convolutional neural network based secure wireless voice communication for underground mines
2021cites this paper
Automatic Speech Recognition And Limited Vocabulary: A Survey
2021cites this paper
Automatic identification of synthetically generated interlanguage transfer phenomena between brazilian portuguese (L1) and english (L2)
2021cites this paper
TalkNet: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis
2021cites this paper
Intelligibility and naturalness of articulatory synthesis with VocalTractLab compared to established speech synthesis technologies
2021cites this paper
A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2
2021cites this paper
A Survey of On-Device Machine Learning
2021cites this paper
Fcl-Taco2: Towards Fast, Controllable and Lightweight Text-to-Speech Synthesis
2021cites this paper
A Survey on Neural Speech Synthesis
2021cites this paper
A sample-level DCNN for music auto-tagging
2021cites this paper
Review of end-to-end speech synthesis technology based on deep learning
2021cites this paper
TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction
2021cites this paper
Pengenalan Karakter Optis untuk Pencatatan Meter Air dengan Long Short Term Memory Recurrent Neural Network
2021cites this paper
Context-Based Feature Technique for Sarcasm Identification in Benchmark Datasets Using Deep Learning and BERT Model
2021cites this paper
LIG-Doctor: Efficient patient trajectory prediction using bidirectional minimal gated-recurrent networks
2021cites this paper
Chinese Speech Synthesis System Based on End to End
2020cites this paper
CLAI: A Platform for AI Skills on the Command Line
2020cites this paper
Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks
2020cites this paper
A survey on automatic speech recognition systems for Portuguese language and its variations
2020cites this paper
TalkNet: Fully-Convolutional Non-Autoregressive Speech Synthesis Model
2020cites this paper
Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
2020cites this paper
Crowdsourcing Latin American Spanish for Low-Resource Text-to-Speech
2020influential citation
Burmese Speech Corpus, Finite-State Text Normalization and Pronunciation Grammars with an Application to Text-to-Speech
2020cites this paper
Deep Neural Mobile Networking
2020cites this paper
Robust model training and generalisation with Studentising flows
2020cites this paper
Project CLAI: Instrumenting the Command Line as a New Environment for AI Agents
2020cites this paper
Frugal Paradigm Completion
2020cites this paper
Speech Technology for Healthcare: Opportunities, Challenges, and State of the Art
2020cites this paper
PPSpeech: Phrase based Parallel End-to-End TTS System
2020cites this paper
Lig-Doctor: Real-World Clinical Prognosis using a Bi-Directional Neural Network
2020cites this paper
Improving the Prosody of RNN-Based English Text-To-Speech Synthesis by Incorporating a BERT Model
2020cites this paper
Tonal Contour Generation for Isarn Speech Synthesis Using Deep Learning and Sampling-Based F0 Representation
2020cites this paper
DurIAN: Duration Informed Attention Network for Speech Synthesis
2020cites this paper
Integrated Framework for Data Quality and Security Evaluation on Mobile Devices
2020cites this paper
Deep Learning based NLP Techniques In Text to Speech Synthesis for Communication Recognition
2020cites this paper
Towards the implementation of an Attention-based Neural Machine Translation with artificial pronunciation for Nahuatl as a mobile application
2020cites this paper
ASVspoof 2019: a large-scale public database of synthetic, converted and replayed speech.
2020cites this paper
Master thesis : Automatic Multispeaker Voice Cloning
2019influential citation
Speech Technology Progress Based on New Machine Learning Paradigm
2019cites this paper
DurIAN: Duration Informed Attention Network For Multimodal Synthesis
2019cites this paper
Patient trajectory prediction in the Mimic-III dataset, challenges and pitfalls
2019influential citation
Central Audio-Library of the University of Novi Sad
2019cites this paper
DNN-based laughter synthesis
2019cites this paper
On-Device Machine Learning: An Algorithms and Learning Theory Perspective
2019cites this paper
The ASVspoof 2019 database
2019cites this paper
Using generative modelling to produce varied intonation for speech synthesis
2019influential citation
DNN Based Expressive Text-to-Speech with Limited Training Data
2019cites this paper
A2Text-Net: A Novel Deep Neural Network for Sarcasm Detection
2019cites this paper
The IIM System for Blizzard Challenge 2019
2019influential citation
CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network
2019influential citation
ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech
2019cites this paper
A Prosodic Mandarin Text-to-Speech System Based on Tacotron
2019cites this paper
Analysing Deep Learning-Spectral Envelope Prediction Methods for Singing Synthesis
2019cites this paper
Intrinsically Sparse Long Short-Term Memory Networks
2019cites this paper
Real-time Voice Cloning
2019influential citation
Prosody generation for text-to-speech synthesis
2019cites this paper
Deep Learning in Mobile and Wireless Networking: A Survey
2018cites this paper
Sample Efficient Adaptive Text-to-Speech
2018influential citation
An Investigation of Noise Shaping with Perceptual Weighting for Wavenet-Based Speech Generation
2018cites this paper
Deep Learning and its Applications Surveys on Future Mobile Networks Deep Learning Driven Networking Applications Fundamental Principles Advantages Multilayer Perceptron Boltzmann Machine Auto-encoder Convolutional Neural Network Recurrent Neural Network Generative Adversarial Network Deep Reinforce
2018cites this paper
Emphatic Speech Prosody Prediction with Deep Lstm Networks
2018cites this paper
Interactive Design of 3D Dynamic Gesture Based on SVM-LSTM Model
2018cites this paper
EMPHASIS: An Emotional Phoneme-based Acoustic Model for Speech Synthesis System
2018influential citation
Localized Mandarin Speech Synthesis Services for Enterprise Scenarios
2018cites this paper
High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networks
2018cites this paper
Reinforcement learning and reward estimation for dialogue policy optimisation
2018cites this paper
Modelling customer online behaviours with neural networks: applications to conversion prediction and advertising retargeting
2018cites this paper
The USTC System for Blizzard Challenge 2018
2018influential citation
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
2018cites this paper
Grow and Prune Compact, Fast, and Accurate LSTMs
2018cites this paper
ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms
2018cites this paper
Building Open Javanese and Sundanese Corpora for Multilingual Text-to-Speech
2018cites this paper
An autoregressive recurrent mixture density network for parametric speech synthesis
2017cites this paper
Unit Selection with Hierarchical Cascaded Long Short Term Memory Bidirectional Recurrent Neural Nets
2017cites this paper
Google's Next-Generation Real-Time Unit-Selection Synthesizer Using Sequence-to-Sequence LSTM-Based Autoencoders
2017influential citation
Uniform Multilingual Multi-Speaker Acoustic Model for Statistical Parametric Speech Synthesis of Low-Resourced Languages
2017influential citation