Lip Enhancement and Multi-View Simulation for Robust Visual Speech Recognition in MAVSR 2025

Published 2025 in IEEE International Conference on Automatic Face & Gesture Recognition

ABSTRACT

In this paper, we present our work for Visual Speech Recognition (VSR) in the Mandarin Audio-Visual Speech Recognition (MAVSR) Challenge 2025, with a particular focus on improving lipreading under challenging visual conditions. The proposed system leverages cross-modal knowledge transfer and employs a progressive training strategy based on large-scale speech and visual speech datasets. Furthermore, we introduce LIPER, a visual enhancement module designed to generate improved lip-region visual data under conditions such as low resolution, poor illumination, and color distortion. LIPER further facilitates the synthesis of multi-view lip movements through lip pose estimation and 3D reconstruction. These enhancements significantly improve the robustness of the VSR system under low-quality visual conditions. Experimental results show that the proposed approach achieves relative character error rate (CER) reductions of 16.1% on the MOV20-Test set, compared to the official baseline system in track 1, and achieves second place among submitted systems in the challenge. The code is available at https://github.com/yaku122/RVSR.

PUBLICATION RECORD

Publication year
2025
Venue
IEEE International Conference on Automatic Face & Gesture Recognition
Publication date
2025-05-26
Fields of study
Computer Science
Identifiers
DOI 10.1109/FG61629.2025.11099112
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

ALIGNVSR: AUDIO-VISUAL CROSS-MODAL ALIGNMENT FOR VISUAL SPEECH RECOGNITION
2024cited by this paper
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations
2024influential reference
SynthVSR: Scaling Up Visual Speech RecognitionWith Synthetic Supervision
2023cited by this paper
Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition
2023cited by this paper
Two-stage visual speech recognition for intensive care patients
2023cited by this paper
LITEVSR: Efficient Visual Speech Recognition by Learning from Speech Representations of Unlabeled Data
2023cited by this paper
Analyzing Modality Robustness in Multimodal Sentiment Analysis
2022cited by this paper
Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning
2022cited by this paper
Visual speech recognition for multiple languages in the wild
2022cited by this paper
The DKU Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge
2022cited by this paper
WENETSPEECH: A 10000+ Hours Multi-Domain Mandarin Corpus for Speech Recognition
2021cited by this paper
Cross-modal Assisted Training for Abnormal Event Recognition in Elevators
2021cited by this paper
Neural Head Avatars from Monocular RGB Videos
2021cited by this paper
End-To-End Audio-Visual Speech Recognition with Conformers
2021cited by this paper
Learning from the Master: Distilling Cross-modal Advanced Knowledge for Lip Reading
2021cited by this paper
Conformer: Convolution-augmented Transformer for Speech Recognition
2020cited by this paper
First Order Motion Model for Image Animation
2020cited by this paper
Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
2019cited by this paper
Towards Pose-Invariant Lip-Reading
2019cited by this paper
ASR is All You Need: Cross-Modal Distillation for Lip Reading
2019cited by this paper
Deep Audio-Visual Speech Recognition
2018cited by this paper
LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild
2018cited by this paper
Large-Scale Visual Speech Recognition
2018cited by this paper
Deep video portraits
2018cited by this paper
LRS3-TED: a large-scale dataset for visual speech recognition
2018cited by this paper
Combining Residual Networks with LSTMs for Lipreading
2017cited by this paper
Learning a model of facial shape and expression from 4D scans
2017cited by this paper
Face2Face: Real-Time Face Capture and Reenactment of RGB Videos
2016cited by this paper
Lip Reading Sentences in the Wild
2016cited by this paper
Frontal face reconstruction with symmetric constraints
2016cited by this paper
Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields
2016cited by this paper
Lip Reading in the Wild
2016cited by this paper
Lip reading using a dynamic feature of lip images and convolutional neural networks
2016cited by this paper
High-fidelity Pose and Expression Normalization for face recognition in the wild
2015cited by this paper
OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis
2015cited by this paper
Ensemble deep learning for speech recognition
2014cited by this paper
A review of recent advances in visual speech decoding
2014cited by this paper
A morphable model for the synthesis of 3D faces
1999cited by this paper
A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER)
1997cited by this paper
"Eigenlips" for robust speech recognition
1994cited by this paper
Representing Scenes as Neural Radiance Fields for View Synthesis
year unknowncited by this paper

CITED BY

No citing papers are available for this paper.