In this paper, we present our work for Visual Speech Recognition (VSR) in the Mandarin Audio-Visual Speech Recognition (MAVSR) Challenge 2025, with a particular focus on improving lipreading under challenging visual conditions. The proposed system leverages cross-modal knowledge transfer and employs a progressive training strategy based on large-scale speech and visual speech datasets. Furthermore, we introduce LIPER, a visual enhancement module designed to generate improved lip-region visual data under conditions such as low resolution, poor illumination, and color distortion. LIPER further facilitates the synthesis of multi-view lip movements through lip pose estimation and 3D reconstruction. These enhancements significantly improve the robustness of the VSR system under low-quality visual conditions. Experimental results show that the proposed approach achieves relative character error rate (CER) reductions of 16.1% on the MOV20-Test set, compared to the official baseline system in track 1, and achieves second place among submitted systems in the challenge. The code is available at https://github.com/yaku122/RVSR.
ABSTRACT
PUBLICATION RECORD
- Publication year
2025
- Venue
IEEE International Conference on Automatic Face & Gesture Recognition
- Publication date
2025-05-26
- Fields of study
Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-41 of 41 references · Page 1 of 1
CITED BY
- No citing papers are available for this paper.
Showing 0-0 of 0 citing papers · Page 1 of 1