Aligning Multimodal Data for Fine-Grained Video Understanding via Cross-Attentive Recurrent Fusion

Nam-Ho Kim,Jun-Hwa Kim

Published 2025 in 2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

ABSTRACT

Fine-grained video classification requires understanding complex spatio-temporal and semantic cues that often exceed the capacity of a single modality. In this paper, we propose a multimodal framework that fuses video, image, and text representations using GRU-based sequence encoders and cross-modal attention mechanisms. The model is trained using a combination of classification or regression loss, depending on the task, and is further regularized through feature-level augmentation and autoencoding techniques. To evaluate the generality of our framework, we conduct experiments on two challenging benchmarks: the DVD dataset for real-world Violence Detection and the Aff-Wild2 dataset for Valence-Arousal estimation. Our results demonstrate that the proposed fusion strategy significantly outperforms unimodal baselines, with cross-attention and feature augmentation contributing notably to robustness and performance. The code used in the 9th ABAW competition is available at https://github.com/namho-96/ABAW-9th.

PUBLICATION RECORD

CITATION MAP

EXTRACTION MAP

CLAIMS

  • No claims are published for this paper.

CONCEPTS

  • No concepts are published for this paper.

REFERENCES

Showing 1-25 of 25 references · Page 1 of 1

CITED BY

  • No citing papers are available for this paper.

Showing 0-0 of 0 citing papers · Page 1 of 1