Aligning Multimodal Data for Fine-Grained Video Understanding via Cross-Attentive Recurrent Fusion

Published 2025 in 2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

ABSTRACT

Fine-grained video classification requires understanding complex spatio-temporal and semantic cues that often exceed the capacity of a single modality. In this paper, we propose a multimodal framework that fuses video, image, and text representations using GRU-based sequence encoders and cross-modal attention mechanisms. The model is trained using a combination of classification or regression loss, depending on the task, and is further regularized through feature-level augmentation and autoencoding techniques. To evaluate the generality of our framework, we conduct experiments on two challenging benchmarks: the DVD dataset for real-world Violence Detection and the Aff-Wild2 dataset for Valence-Arousal estimation. Our results demonstrate that the proposed fusion strategy significantly outperforms unimodal baselines, with cross-attention and feature augmentation contributing notably to robustness and performance. The code used in the 9th ABAW competition is available at https://github.com/namho-96/ABAW-9th.

PUBLICATION RECORD

Publication year
2025
Venue
2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Publication date
2025-10-19
Fields of study
Not labeled
Identifiers
DOI 10.1109/ICCVW69036.2025.00018
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

DVD: A Comprehensive Dataset for Advancing Violence Detection in Real-World Scenarios
2025cited by this paper
Distribution Matching for Multi-Task Learning of Classification Tasks: a Large-Scale Study on Faces & Beyond
2024cited by this paper
7th ABAW Competition: Multi-Task Learning and Compound Expression Recognition
2024cited by this paper
The 6th Affective Behavior Analysis in-the-wild (ABAW) Competition
2024cited by this paper
ABAW: Valence-Arousal Estimation, Expression Recognition, Action Unit Detection & Emotional Reaction Intensity Estimation Challenges
2023cited by this paper
Multi-Label Compound Expression Recognition: C-EXPR Database & Network
2023cited by this paper
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
2023cited by this paper
ABAW: Valence-Arousal Estimation, Expression Recognition, Action Unit Detection & Multi-Task Learning Challenges
2022cited by this paper
Affect Analysis in-the-wild: Valence-Arousal, Expressions, Action Units and a Unified Framework
2021cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021cited by this paper
EfficientNetV2: Smaller Models and Faster Training
2021cited by this paper
Distribution Matching for Heterogeneous Multi-Task Learning: a Large-scale Face Study
2021cited by this paper
Analysing Affective Behavior in the second ABAW2 Competition
2021cited by this paper
Masked Autoencoders Are Scalable Vision Learners
2021cited by this paper
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
2020cited by this paper
Analysing Affective Behavior in the First ABAW 2020 Competition
2020cited by this paper
Expression, Affect, Action Unit Recognition: Aff-Wild2, Multi-Task Learning and ArcFace
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
Face Behavior à la carte: Expressions, Affect and Action Units in a Single Network
2019cited by this paper
Deep Affect Prediction in-the-Wild: Aff-Wild Database and Challenge, Deep Architectures, and Beyond
2018cited by this paper
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
2018cited by this paper
Attention is All you Need
2017cited by this paper
CNN architectures for large-scale audio classification
2016cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
2014cited by this paper

CITED BY

No citing papers are available for this paper.