AudioVisual Speech Synthesis: A brief literature review

Efthymios Georgiou,Athanasios Katsamanis

Published 2021 in arXiv.org

ABSTRACT

Η παρούσα βιβλιογραφική ανασκόπηση μελετάει το πρόβλημα της οπτικοακουστικής σύνθεσης φωνής. Ουσιαστικά δηλαδή εξετάζουμε πως μπορούμε από κάποιο κείμενο εισόδου να συνθέσουμε μια ανθρωπόμορφη οπτική ροή καθώς επίσης και την αντίστοιχη φωνή. Εξαιτίας της πολύ μεγάλης πολυπλοκότητας του προβλήματος αυτού, χρειάζεται να το μελετήσουμε σε δύο επιμέρους τμήματα. Συγκεκριμένα, αυτό της σύνθεσης φωνής από κείμενο (text-tospeech synthesis) καθώς και τη σύνθεση ανθρωπόμορφης ροής από φωνή. Σε ότι αφορά τη σύνθεση φωνής μελετάμε τόσο τα δίκτυα που κάνουν την απεικόνιση από το κείμενο σε κάποια ενδιάμεση αναπράσταση καθώς επίσης και τα δίκτυα που παράγουν φωνή από τις ενδιάμεσες αυτές αναπραστάσεις. Ως προς την οπτική σύνθεση, κατηγοριοποιούμε τις προσεγγίσεις με βάση το αν παράγουν ανθρώπινα πρόσωπα ή ανθρωπόμορφες φιγούρες. Προσπάθεια γίνεται επίσης να παρουσιαστεί η σημασία της επιλογής των μοντέλων προσώπου στη δεύτερη περίπτωση. Καθόλη την έκταση της ανασκόπησης, παρουσιάζουμε τις σημαντικότερες, κατά τη γνώμη μας, εργασίες και στα δύο αυτά πεδία, προσπαθώντας να δώσουμε βάση στα πλεονεκτήματα και μειονεκτήματα της κάθε μιας.

PUBLICATION RECORD

Publication year
2021
Venue
arXiv.org
Publication date
2021-02-18
Fields of study
Linguistics, Engineering, Computer Science
Identifiers
arXiv 2103.03927
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Generative Adversarial Networks
2021cited by this paper
MakeItTalk: Speaker-Aware Talking Head Animation
2020cited by this paper
First Order Motion Model for Image Animation
2020cited by this paper
Zero-Shot Voice Style Transfer with Only Autoencoder Loss
2019cited by this paper
Capture, Learning, and Synthesis of 3D Speaking Styles
2019cited by this paper
Probability density distillation with generative adversarial networks for high-quality parallel waveform generation
2019cited by this paper
The face of art
2019cited by this paper
The Uncanny Valley
2019cited by this paper
WaveFlow: A Compact Flow-based Model for Raw Audio
2019influential reference
High Fidelity Speech Synthesis with Adversarial Networks
2019cited by this paper
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
2019cited by this paper
3D Morphable Face Models—Past, Present, and Future
2019cited by this paper
Realistic Speech-Driven Facial Animation with GANs
2019influential reference
Glow: Generative Flow with Invertible 1x1 Convolutions
2018cited by this paper
FloWaveNet : A Generative Flow for Raw Audio
2018cited by this paper
Waveglow: A Flow-based Generative Network for Speech Synthesis
2018influential reference
Visemenet
2018cited by this paper
LPCNET: Improving Neural Speech Synthesis through Linear Prediction
2018cited by this paper
VoxCeleb2: Deep Speaker Recognition
2018cited by this paper
Neural Speech Synthesis with Transformer Network
2018cited by this paper
ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
2018cited by this paper
Fftnet: A Real-Time Speaker-Dependent Neural Vocoder
2018cited by this paper
VoxCeleb: A Large-Scale Speaker Identification Dataset
2017cited by this paper
MoFA: Model-Based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction
2017cited by this paper
Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
2017cited by this paper
Tacotron: Towards End-to-End Speech Synthesis
2017influential reference
Masked Autoregressive Flow for Density Estimation
2017cited by this paper
Speech-Driven 3D Facial Animation with Implicit Emotional Awareness: A Deep Learning Approach
2017cited by this paper
Audio-driven facial animation by joint end-to-end learning of pose and emotion
2017cited by this paper
Attention is All you Need
2017cited by this paper
How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks)
2017cited by this paper
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
2017cited by this paper
Char2Wav: End-to-End Speech Synthesis
2017cited by this paper
A deep learning approach for generalized speech animation
2017cited by this paper
Video-realistic expressive audio-visual speech synthesis for the Greek language
2017cited by this paper
Face2Face: Real-Time Face Capture and Reenactment of RGB Videos
2016cited by this paper
Conditional Image Generation with PixelCNN Decoders
2016cited by this paper
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
2016cited by this paper
An Algorithm for the Machine Calculation of
2016cited by this paper
Improved Variational Inference with Inverse Autoregressive Flow
2016cited by this paper
WaveNet: A Generative Model for Raw Audio
2016influential reference
Variational Inference with Normalizing Flows
2015cited by this paper
Distilling the Knowledge in a Neural Network
2015influential reference
Do Deep Nets Really Need to be Deep?
2013cited by this paper
The Uncanny Valley [From the Field]
2012cited by this paper
ImageNet classification with deep convolutional neural networks
2012cited by this paper
the Journal of the Acoustical Society of America
2010cited by this paper
The Uncanny Valley: Effect of Realism on the Impression of Artificial Human Faces
2007cited by this paper
Active Appearance Models Revisited
2004cited by this paper
Animated Pedagogical Agents: Face-to-Face Interaction in Interactive Learning Environments
2000cited by this paper
A Morphable Model For The Synthesis Of 3D Faces
1999cited by this paper
Active Appearance Models
1998cited by this paper
Gradient-based learning applied to document recognition
1998cited by this paper
Long Short-Term Memory
1997cited by this paper
Linear predictive coding
1988cited by this paper
Learning representations by back-propagating errors
1986cited by this paper
Probabilistic Sensitivity Analysis Using Monte Carlo Simulation
1985cited by this paper
Expression and the Nature of Emotion
1984cited by this paper
Monte Carlo simulation of a many-fermion study
1977cited by this paper
An algorithm for the machine calculation of complex Fourier series
1965cited by this paper
The Weizmann Institute of Science
1962cited by this paper
Visual contribution to speech intelligibility in noise
1954cited by this paper

CITED BY

No citing papers are available for this paper.