Using Multi-Layer Bidirectional Distillation to Enhance Local and Global Features for Action Recognition

Shilu Kang,Hua Huo,Jiaxin Xu,Aokun Mei,Chen Zhang

Published 2025 in Italian National Conference on Sensors

ABSTRACT

Different action recognition tasks exhibit significant variations in their reliance on local versus global features. Particularly for long-video understanding, dynamically balancing the contributions of both has become a critical challenge for improving recognition accuracy. This paper proposes a Multi-Layer Bidirectional Distillation Model (MBD) based on the two-stream architecture. It employs 3D CNN and video Transformer to capture local and global spatio-temporal features of videos, respectively, aiming to explore the complementary mechanisms between these two feature types and facilitate their synergistic enhancement across diverse recognition task scenarios. The model quantifies feature contributions across specific recognition tasks to map feature dominance, categorizing videos into distinct feature-dominant groups. This mechanism provides a clear direction for knowledge transfer, overcoming the limitations of traditional unidirectional knowledge distillation. Bidirectional knowledge distillation is then performed at the intermediate and final layers, training the model to learn complementary relationships between features and addressing the issue of insufficient representational capacity of non-dominant features. During inference, an adaptive fusion strategy based on feature dominance is adopted, achieving feature fusion via dynamic weighted summation. This mechanism effectively suppresses noise interference from non-dominant features while maximizing the discriminative advantages of dominant features. The MBD model undergoes systematic comparative experiments across four classic action recognition benchmarks (UCF101, HMDB51, Kinectics-400, Something-Something V2). The results demonstrate that the MBD model not only excels in short-video recognition but also outperforms in analyzing complex actions under long-video scenarios.

PUBLICATION RECORD

CITATION MAP

EXTRACTION MAP

CLAIMS

  • No claims are published for this paper.

CONCEPTS

  • No concepts are published for this paper.

REFERENCES

Showing 1-49 of 49 references · Page 1 of 1

CITED BY

  • No citing papers are available for this paper.

Showing 0-0 of 0 citing papers · Page 1 of 1