CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge

Zehua Liu,Xiaolou Li,Chen Chen,Lantian Li,Dong Wang

Published 2025 in Interspeech

ABSTRACT

This paper presents the second Chinese Continuous Visual Speech Recognition Challenge (CNVSRC 2024), which builds on CNVSRC 2023 to advance research in Chinese Large Vocabulary Continuous Visual Speech Recognition (LVC-VSR). The challenge evaluates two test scenarios: reading in recording studios and Internet speech. CNVSRC 2024 uses the same datasets as its predecessor CNVSRC 2023, which involves CN-CVS for training and CNVSRC-Single/Multi for development and evaluation. However, CNVSRC 2024 introduced two key improvements: (1) a stronger baseline system, and (2) an additional dataset, CN-CVS2-P1, for open tracks to improve data volume and diversity. The new challenge has demonstrated several important innovations in data preprocessing, feature extraction, model design, and training strategies, further pushing the state-of-the-art in Chinese LVC-VSR. More details and resources are available at the official website.

PUBLICATION RECORD

Publication year
2025
Venue
Interspeech
Publication date
2025-05-27
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.48550/arXiv.2506.02010 arXiv 2506.02010
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge
2024influential reference
Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder
2024cited by this paper
The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023
2024influential reference
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
2023cited by this paper
Multi-Temporal Lip-Audio Memory for Visual Speech Recognition
2023cited by this paper
CN-CVS: A Mandarin Audio-Visual Dataset for Large Vocabulary Continuous Visual to Speech Synthesis
2023cited by this paper
Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition
2022cited by this paper
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
2022cited by this paper
MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification
2022cited by this paper
Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading
2022cited by this paper
Training Strategies for Improved Lip-Reading
2022cited by this paper
On the Parameterization and Initialization of Diagonal State Space Models
2022cited by this paper
E-Branchformer: Branchformer with Enhanced Merging for Speech Recognition
2022cited by this paper
Structured State Space Decoder for Speech Recognition and Synthesis
2022cited by this paper
Improved Word-level Lipreading with Temporal Shrinkage Network and NetVLAD
2022cited by this paper
Efficiently Modeling Long Sequences with Structured State Spaces
2021cited by this paper
U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition
2021cited by this paper
Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery Detection
2020cited by this paper
Conformer: Convolution-augmented Transformer for Speech Recognition
2020cited by this paper
Large-Scale Visual Speech Recognition
2018cited by this paper
Decoupled Weight Decay Regularization
2017cited by this paper
Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
2017cited by this paper
Combining Residual Networks with LSTMs for Lipreading
2017cited by this paper
Lip Reading in the Wild
2016cited by this paper
Visual Speech Recognition
2011cited by this paper
Audiovisual integration and lipreading abilities of older adults with normal and impaired hearing.
2006cited by this paper
A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER)
1997cited by this paper
Continuous automatic speech recognition by lipreading
1993cited by this paper

CITED BY

VALLR-Pin: Uncertainty-Factorized Visual Speech Recognition for Mandarin with Pinyin Guidance
2025cites this paper