InfoSyncNet: Information Synchronization Temporal Convolutional Network for Visual Speech Recognition

Junxiao Xue,Xiaozhen Liu,Xue-Qing Wu,Fei Yu,Jun Wang

Published 2025 in IEEE International Joint Conference on Neural Network

ABSTRACT

Estimating spoken content from silent videos is crucial for applications in Assistive Technology (AT) and Augmented Reality (AR). However, accurately mapping lip movement sequences in videos to words poses significant challenges due to variability across sequences and the uneven distribution of information within each sequence. To tackle this, we introduce InfoSyncNet, a non-uniform sequence modeling network enhanced by tailored data augmentation techniques. Central to InfoSyncNet is a non-uniform quantization module positioned between the encoder and decoder, enabling dynamic adjustment to the network’s focus and effectively handling the natural inconsistencies in visual speech data. Additionally, multiple training strategies are incorporated to enhance the model’s capability to handle variations in lighting and the speaker’s orientation. Comprehensive experiments on the LRW and LRW1000 datasets confirm the superiority of InfoSyncNet, achieving new state-of-the-art accuracies of 92.0% and 60.7% Top-1 ACC 1.

PUBLICATION RECORD

Publication year
2025
Venue
IEEE International Joint Conference on Neural Network
Publication date
2025-06-30
Fields of study
Computer Science
Identifiers
DOI 10.1109/IJCNN64981.2025.11228396 arXiv 2508.02460
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Audio-Visual Speech Recognition In-The-Wild: Multi-Angle Vehicle Cabin Corpus and Attention-Based Method
2024influential reference
GSLip: A Global Lip-Reading Framework with Solid Dilated Convolutions
2024influential reference
Audio-visual speech recognition based on regulated transformer and spatio-temporal fusion strategy for driver assistive systems
2024cited by this paper
Sequential Modeling by Leveraging Non-Uniform Distribution of Speech Emotion
2023cited by this paper
Multi-Temporal Lip-Audio Memory for Visual Speech Recognition
2023influential reference
Audio-Visual Efficient Conformer for Robust Speech Recognition
2023cited by this paper
Accurate and Resource-Efficient Lipreading with Efficientnetv2 and Transformers
2022influential reference
Lipreading Model Based On Whole-Part Collaborative Learning
2022influential reference
Training Strategies for Improved Lip-Reading
2022influential reference
Improved Word-level Lipreading with Temporal Shrinkage Network and NetVLAD
2022influential reference
On the Role of LIP Articulation in Visual Speech Perception
2022cited by this paper
Speaker-adaptive Lip Reading with User-dependent Padding
2022cited by this paper
Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading
2022influential reference
EfficientNetV2: Smaller Models and Faster Training
2021cited by this paper
Learn an Effective Lip Reading Model without Pains
2020influential reference
Lipreading Using Temporal Convolutional Networks
2020influential reference
Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition
2020influential reference
Discriminative Multi-Modality Speech Recognition
2020influential reference
Towards Practical Lipreading with Distilled and Efficient Models
2020influential reference
Lip-reading with Densely Connected Temporal Convolutional Networks
2020influential reference
Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading
2019cited by this paper
Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading
2019cited by this paper
Towards Pose-Invariant Lip-Reading
2019cited by this paper
LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild
2018cited by this paper
End-to-End Audiovisual Speech Recognition
2018influential reference
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
2018cited by this paper
ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design
2018cited by this paper
Deep Audio-Visual Speech Recognition
2018cited by this paper
Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands
2018cited by this paper
Decoupled Weight Decay Regularization
2017cited by this paper
Combining Residual Networks with LSTMs for Lipreading
2017cited by this paper
Attention is All you Need
2017cited by this paper
Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition
2017cited by this paper
Squeeze-and-Excitation Networks
2017influential reference
Deep complementary bottleneck features for visual speech recognition
2016cited by this paper
Lip Reading in the Wild
2016cited by this paper
Temporal Convolutional Networks for Action Segmentation and Detection
2016cited by this paper
Densely Connected Convolutional Networks
2016cited by this paper
A Decision Tree Framework for Spatiotemporal Sequence Prediction
2015cited by this paper
Going deeper with convolutions
2014cited by this paper
ImageNet Large Scale Visual Recognition Challenge
2014cited by this paper
Expressive Visual Text-to-Speech Using Active Appearance Models
2013cited by this paper
Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching state-space model
2010cited by this paper
A PCA Based Visual DCT Feature Extraction Method for Lip-Reading
2006cited by this paper

CITED BY

No citing papers are available for this paper.