Using Multi-Layer Bidirectional Distillation to Enhance Local and Global Features for Action Recognition

Shilu Kang,Hua Huo,Jiaxin Xu,Aokun Mei,Chen Zhang

Published 2025 in Italian National Conference on Sensors

ABSTRACT

Different action recognition tasks exhibit significant variations in their reliance on local versus global features. Particularly for long-video understanding, dynamically balancing the contributions of both has become a critical challenge for improving recognition accuracy. This paper proposes a Multi-Layer Bidirectional Distillation Model (MBD) based on the two-stream architecture. It employs 3D CNN and video Transformer to capture local and global spatio-temporal features of videos, respectively, aiming to explore the complementary mechanisms between these two feature types and facilitate their synergistic enhancement across diverse recognition task scenarios. The model quantifies feature contributions across specific recognition tasks to map feature dominance, categorizing videos into distinct feature-dominant groups. This mechanism provides a clear direction for knowledge transfer, overcoming the limitations of traditional unidirectional knowledge distillation. Bidirectional knowledge distillation is then performed at the intermediate and final layers, training the model to learn complementary relationships between features and addressing the issue of insufficient representational capacity of non-dominant features. During inference, an adaptive fusion strategy based on feature dominance is adopted, achieving feature fusion via dynamic weighted summation. This mechanism effectively suppresses noise interference from non-dominant features while maximizing the discriminative advantages of dominant features. The MBD model undergoes systematic comparative experiments across four classic action recognition benchmarks (UCF101, HMDB51, Kinectics-400, Something-Something V2). The results demonstrate that the MBD model not only excels in short-video recognition but also outperforms in analyzing complex actions under long-video scenarios.

PUBLICATION RECORD

Publication year
2025
Venue
Italian National Conference on Sensors
Publication date
2025-11-01
Fields of study
Medicine, Computer Science
Identifiers
DOI 10.3390/s25226849 PMID 41305058 PMCID 12656224
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

A Comprehensive Survey on Knowledge Distillation
2025cited by this paper
C2KD: Bridging the Modality Gap for Cross-Modal Knowledge Distillation
2024cited by this paper
Top-Heavy CapsNets Based on Spatiotemporal Non-Local for Action Recognition
2024influential reference
VideoMamba: State Space Model for Efficient Video Understanding
2024cited by this paper
Active Exploration of Multimodal Complementarity for Few-Shot Action Recognition
2023cited by this paper
Asymmetric Masked Distillation for Pre-Training Small Foundation Models
2023cited by this paper
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
2022cited by this paper
Spatial-temporal interaction learning based two-stream network for action recognition
2022cited by this paper
Cross-Architecture Self-supervised Video Representation Learning
2022influential reference
Model-agnostic Multi-Domain Learning with Domain-Specific Adapters for Action Recognition
2022influential reference
UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning
2022cited by this paper
Is Space-Time Attention All You Need for Video Understanding?
2021cited by this paper
Self-supervised Video Transformer
2021influential reference
Selective Dependency Aggregation for Action Classification
2021cited by this paper
Video Swin Transformer
2021cited by this paper
Deep Cross-Modal Representation Learning and Distillation for Illumination-Invariant Pedestrian Detection
2021cited by this paper
Multiscale Vision Transformers
2021cited by this paper
Video Transformer Network
2021cited by this paper
TEA: Temporal Excitation and Aggregation for Action Recognition
2020cited by this paper
Correlational Convolutional LSTM for human action recognition
2020influential reference
Temporal Pyramid Network for Action Recognition
2020cited by this paper
X3D: Expanding Architectures for Efficient Video Recognition
2020cited by this paper
Longformer: The Long-Document Transformer
2020influential reference
TAM: Temporal Adaptive Module for Video Recognition
2020cited by this paper
TDN: Temporal Difference Networks for Efficient Action Recognition
2020cited by this paper
TSM: Temporal Shift Module for Efficient Video Understanding
2018influential reference
SlowFast Networks for Video Recognition
2018cited by this paper
A Closer Look at Spatiotemporal Convolutions for Action Recognition
2017influential reference
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
2017influential reference
The “Something Something” Video Database for Learning and Evaluating Visual Common Sense
2017cited by this paper
Lattice Long Short-Term Memory for Human Action Recognition
2017influential reference
Hidden Two-Stream Convolutional Networks for Action Recognition
2017cited by this paper
Temporal Segment Networks for Action Recognition in Videos
2017influential reference
Attention is All you Need
2017cited by this paper
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
2017cited by this paper
Non-local Neural Networks
2017cited by this paper
Convolutional Two-Stream Network Fusion for Video Action Recognition
2016cited by this paper
SGDR: Stochastic Gradient Descent with Warm Restarts
2016influential reference
Distilling the Knowledge in a Neural Network
2015cited by this paper
Learning Spatiotemporal Features with 3D Convolutional Networks
2014cited by this paper
Long-term recurrent convolutional networks for visual recognition and description
2014influential reference
FitNets: Hints for Thin Deep Nets
2014cited by this paper
Two-Stream Convolutional Networks for Action Recognition in Videos
2014influential reference
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
2012cited by this paper
HMDB: A large video database for human motion recognition
2011cited by this paper
Histograms of oriented gradients for human detection
2005cited by this paper
Space-time interest points
2003cited by this paper
Neural mechanisms of selective visual attention.
1995cited by this paper
Perception and Communication
1958cited by this paper

CITED BY

No citing papers are available for this paper.