Robust Audio-Visual Speech Recognition Under Noisy Audio-Video Conditions

D. Stewart,Rowan Seymour,Adrian Pass,Ji Ming

Published 2014 in IEEE Transactions on Cybernetics

ABSTRACT

This paper presents the maximum weighted stream posterior (MWSP) model as a robust and efficient stream integration method for audio-visual speech recognition in environments, where the audio or video streams may be subjected to unknown and time-varying corruption. A significant advantage of MWSP is that it does not require any specific measurements of the signal in either stream to calculate appropriate stream weights during recognition, and as such it is modality-independent. This also means that MWSP complements and can be used alongside many of the other approaches that have been proposed in the literature for this problem. For evaluation we used the large XM2VTS database for speaker-independent audio-visual speech recognition. The extensive tests include both clean and corrupted utterances with corruption added in either/both the video and audio streams using a variety of types (e.g., MPEG-4 video compression) and levels of noise. The experiments show that this approach gives excellent performance in comparison to another well-known dynamic stream weighting approach and also compared to any fixed-weighted integration approach in both clean conditions or when noise is added to either stream. Furthermore, our experiments show that the MWSP approach dynamically selects suitable integration weights on a frame-by-frame basis according to the level of noise in the streams and also according to the naturally fluctuating relative reliability of the modalities even in clean conditions. The MWSP approach is shown to maintain robust recognition performance in all tested conditions, while requiring no prior knowledge about the type or level of noise.

PUBLICATION RECORD

Publication year
2014
Venue
IEEE Transactions on Cybernetics
Publication date
2014-01-13
Fields of study
Medicine, Computer Science, Engineering
Identifiers
DOI 10.1109/TCYB.2013.2250954 PMID 23757540
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

On Dynamic Stream Weighting for Audio-Visual Speech Recognition
2012cited by this paper
A multi-stream ASR framework for BLSTM modeling of conversational speech
2011cited by this paper
“Your Word is my Command”: Google Search by Voice: A Case Study
2010cited by this paper
Dynamic visual features for audio-visual speaker verification
2010cited by this paper
A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audio-Visual Speech Recognition System
2010cited by this paper
Improved decision trees for multi-stream HMM-based audio-visual continuous speech recognition
2009cited by this paper
Robust face recognition using posterior union model based neural networks
2009cited by this paper
Feature space video stream consistency estimation for dynamic stream weighting in audio-visual speech recognition
2008cited by this paper
Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos
2008cited by this paper
Fused HMM-adaptation of multi-stream HMMs for audio-visual speech recognition
2007cited by this paper
Audio-visual integration for robust speech recognition using maximum weighted stream posteriors
2007cited by this paper
An Examination of Audio-Visual Fused HMMs for Speaker Recognition
2006cited by this paper
A Posterior Union Model with Applications to Robust Speech and Speaker Recognition
2006cited by this paper
A new posterior based audio-visual integration method for robust speech recognition
2005cited by this paper
Personal identity.
2005cited by this paper
Frame-dependent multi-stream reliability indicators for audio-visual speech recognition
2003cited by this paper
Noisy audio feature enhancement using audio-visual speech data
2002cited by this paper
Maximum entropy and MCE based HMM stream weight estimation for audio-visual ASR
2002cited by this paper
DCT-based video features for audio-visual speech recognition
2002cited by this paper
Dynamic Bayesian Networks for Audio-Visual Speech Recognition
2002cited by this paper
Optimal weighting of posteriors for audio-visual speech recognition
2001cited by this paper
Hierarchical discriminant features for audio-visual LVCSR
2001cited by this paper
Asynchronous stream modeling for large vocabulary audio-visual speech recognition
2001cited by this paper
Stream weight optimization of speech and lip image sequence for audio-visual speech recognition
2000cited by this paper
Aspects of facial biometrics for verification of personal identity
2000cited by this paper
Audio-Visual Speech Modeling for Continuous Speech Recognition
2000influential reference
XM2VTSDB: The Extended M2VTS Database
1999cited by this paper
An image transform approach for HMM based automatic lipreading
1998cited by this paper
On the Integration of Auditory and Visual Parameters in an HMM-based ASR
1996cited by this paper
Lip synchronization using speech-assisted video processing
1995cited by this paper
Hearing lips and seeing voices
1976cited by this paper

CITED BY

Visual-Informed Speech Enhancement Using Attention-Based Beamforming
2026cites this paper
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
2025cites this paper
Bridging semantics across modalities: Decoupled representation learning for audio-visual speech recognition
2025cites this paper
AVE Speech: A Comprehensive Multimodal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals
2025cites this paper
Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment
2025cites this paper
WiVi-GR: Wireless-Visual Joint Representation-Based Accurate Gesture Recognition
2024cites this paper
Improving classification performance of motor imagery BCI through EEG data augmentation with conditional generative adversarial networks
2024cites this paper
Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond
2024cites this paper
Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition
2024cites this paper
A Comprehensive Review of Recent Advances in Deep Neural Networks for Lipreading With Sign Language Recognition
2024cites this paper
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval
2023cites this paper
A Multiple-Teacher Pruning Based Self-Distillation (MT-PSD) Approach to Model Compression for Audio-Visual Wake Word Spotting
2023cites this paper
G-Mix: A Generalized Mixup Learning Framework Toward Flat Minima
2023cites this paper
Audio-visual keyword transformer for unconstrained sentence-level keyword spotting
2023cites this paper
Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring
2023cites this paper
An Empirical Investigation of Human Identity Verification Methods
2023cites this paper
Complementary models for audio-visual speech classification
2022cites this paper
A Self-Supervised Representation Learning Method of Speech Recognition for Smart Grid
2022cites this paper
A Novel Approach to Structured Pruning of Neural Network for Designing Compact Audio-Visual Wake Word Spotting System
2022cites this paper
Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception
2022cites this paper
Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition
2022cites this paper
Robust Threshold Selection for Environment Specific Voice in Speaker Recognition
2022cites this paper
COLD Fusion: Calibrated and Ordinal Latent Distribution Fusion for Uncertainty-Aware Multimodal Emotion Recognition
2022cites this paper
E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis
2022cites this paper
End-to-End Video-to-Speech Synthesis Using Generative Adversarial Networks
2021cites this paper
Development of Visual and Audio Speech Recognition Systems Using Deep Neural Networks
2021cites this paper
DARE: Deceiving Audio-Visual speech Recognition model
2021cites this paper
A Survey on Visual Speech Recognition Approaches
2021cites this paper
Robust Audio-Visual Speech Recognition Based on Hybrid Fusion
2021cites this paper
Automatic Lip-Reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries
2021cites this paper
Large-vocabulary Audio-visual Speech Recognition in Noisy Environments
2021cites this paper
A NOVEL TASK-ORIENTED APPROACH TOWARD AUTOMATED LIP-READING SYSTEM IMPLEMENTATION
2021cites this paper
Speech Recognition Using Spectrogram-Based Visual Features
2020cites this paper
Improved Lite Audio-Visual Speech Enhancement
2020cites this paper
An Analysis of Visual Speech Features for Recognition of Non-articulatory Sounds using Machine Learning
2019cites this paper
On the Importance of Video Action Recognition for Visual Lipreading
2019cites this paper
Improved features and dynamic stream weight adaption for robust Audio-Visual Speech Recognition framework
2019cites this paper
Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading
2019cites this paper
Исследование влияния высокоскоростных видеоданных на точность распознавания аудиовизуальной речи
2019cites this paper
Automatic Speech Recognition : A Review
2019cites this paper
Audio-Visual Speech Recognition System Using Recurrent Neural Network
2019cites this paper
Scope for Deep Learning: A Study in Audio-Visual Speech Recognition
2019cites this paper
Audio Tracking in Noisy Environments by Acoustic Map and Spectral Signature
2018cites this paper
Audio-visual feature fusion via deep neural networks for automatic speech recognition
2018cites this paper
Survey on automatic lip-reading in the era of deep learning
2018influential citation
Audio-Visual Keyword Spotting Based on Multidimensional Convolutional Neural Network
2018cites this paper
The Impact of Reduced Video Quality on Visual Speech Recognition
2018cites this paper
Multimodal speech recognition: increasing accuracy using high speed video data
2018cites this paper
Multimodal Interfaces of Human–Computer Interaction
2018cites this paper
Bioinspired Integration of Auditory and Haptic Perception in Bone Milling Surgery
2018cites this paper
Development on SNR estimator for audio-visual speech recognition based on waveform amplitude distribution analysis
2018cites this paper
A Review of Audio-Visual Speech Recognition
2018cites this paper
Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition
2018cites this paper
Robust Audio-Visual Speech Recognition System based on Gabor Features and Dynamic Stream Weight Adaption
2018cites this paper
A Large-Scale Depth-Based Multimodal Audio-Visual Corpus in Mandarin
2018cites this paper
Block Energy Based Visual Features Using Histogram Of Oriented Gradient For Bimodal Hindi Speech Recognition
2018cites this paper
Using a High-Speed Video Camera for Robust Audio-Visual Speech Recognition in Acoustically Noisy Conditions
2017cites this paper
A cascade gray-stereo visual feature extraction method for visual and audio-visual speech recognition
2017cites this paper
Lip detection and adaptive tracking
2017cites this paper
Audio and visual modality combination in speech processing applications
2017cites this paper
An audio-visual corpus for multimodal automatic speech recognition
2017cites this paper
Principal Component 2-D Long Short-Term Memory for Font Recognition on Single Chinese Characters
2016cites this paper
Performance Enhancement of Automatic Speech Recognition (ASR) Using Robust Wavelet-Based Feature Extraction Techniques
2016cites this paper
Multiple cameras audio visual speech recognition using active appearance model visual features in car environment
2016cites this paper
Bimodal Speech Recognition Fusing Audio-Visual Modalities
2016cites this paper
An interactive speech therapy session using linear predictive coding in Matlab and Arduino
2016cites this paper
Discrimination Between Native and Non-Native Speech Using Visual Features Only
2016cites this paper
Audiovisual Fusion: Challenges and New Approaches
2015cites this paper
A novel approach for multimodal graph dimensionality reduction
2015cites this paper
Impact of each camera on multiple camera visual speech recognizer using ANOVA: A brief study
2015cites this paper
AAM Based Features for Multiple Camera Visual Speech Recognition in Car Environment
2015cites this paper
VidTIMIT audio visual phoneme recognition using AAM visual features and human auditory motivated acoustic wavelet features
2015influential citation
Multiple camera in car audio-visual speech recognition using phonetic and visemic information
2015cites this paper
Performance Evaluation of Bimodal Hindi Speech Recognition under Adverse Environment
2014cites this paper
A review of recent advances in visual speech decoding
2014cites this paper
The selective use of gaze in automatic speech recognition
2014cites this paper
A Survey on Speech Recognition from Lip Movement
year unknowncites this paper
An Improved Structured Pruning Approach to Channel-level Pruning for Designing Compact Audio-Visual Wake Word Spotting System
year unknowncites this paper