Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

Beibei Zhang,Fan Yu,Yanxin Gao,Tongwei Ren,Gangshan Wu

Published 2021 in ACM Multimedia

ABSTRACT

To comprehend long duration videos, the deep video understanding (DVU) task is proposed to recognize interactions on scene level and relationships on movie level and answer questions on these two levels. In this paper, we propose a solution to the DVU task which applies joint learning of interaction and relationship prediction and multimodal feature fusion. Our solution handles the DVU task with three joint learning sub-tasks: scene sentiment classification, scene interaction recognition and super-scene video relationship recognition, all of which utilize text features, visual features and audio features, and predict representations in semantic space. Since sentiment, interaction and relationship are related to each other, we train a unified framework with joint learning. Then, we answer questions for video analysis in DVU according to the results of the three sub-tasks. We conduct experiments on the HLVU dataset to evaluate the effectiveness of our method.

PUBLICATION RECORD

Publication year
2021
Venue
ACM Multimedia
Publication date
2021-10-17
Fields of study
Computer Science
Identifiers
DOI 10.1145/3474085.3479214
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Shot Contrastive Self-Supervised Learning for Scene Boundary Detection
2021cited by this paper
Deep Relationship Analysis in Video with Multimodal Feature Fusion
2020influential reference
Online Multi-modal Person Search in Videos
2020cited by this paper
RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild
2020cited by this paper
HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do
2020cited by this paper
A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation
2020cited by this paper
Learning Interactions and Relationships Between Movie Characters
2020cited by this paper
Tracking Objects as Points
2020cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
Tracking Without Bells and Whistles
2019cited by this paper
Video Summarization by Learning From Unpaired Data
2018cited by this paper
A Domain Based Approach to Social Relation Recognition
2017cited by this paper
Video Visual Relation Detection
2017cited by this paper
Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks
2016cited by this paper
Learning Spatiotemporal Features with 3D Convolutional Networks
2014cited by this paper
Learning to Segment a Video to Clips Based on Scene and Camera Motion
2012cited by this paper
Multiple feature hashing for real-time large scale near-duplicate video retrieval
2011cited by this paper
Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition
2009cited by this paper
Reducing drift in differential tracking
2008cited by this paper
SURF: Speeded Up Robust Features
2006cited by this paper
Robust online appearance models for visual tracking
2001cited by this paper

CITED BY

Synergizing Multimodal Temporal Knowledge Graphs and Large Language Models for Social Relation Recognition
2025cites this paper
Key Clues Guided Video Character Social Relationship Recognition Enhanced by LLM
2025cites this paper
Multimodal early fusion operators for temporal video scene segmentation tasks
2023cites this paper
Shifted GCN-GAT and Cumulative-Transformer based Social Relation Recognition for Long Videos
2023influential citation
Development of a MultiModal Annotation Framework and Dataset for Deep Video Understanding
2022cites this paper
Two stage Multi-Modal Modeling for Video Interaction Analysis in Deep Video Understanding Challenge
2022cites this paper
Multimodal Analysis for Deep Video Understanding with Video Language Transformer
2022cites this paper
A Multi-Stream Approach for Video Understanding
2022cites this paper
RETRACTED ARTICLE: ICDN: integrating consistency and difference networks by transformer for multimodal sentiment analysis
2022cites this paper
Text Reconstruction Method of College English Textbooks from the Perspective of Language Images
2022cites this paper
TSPNet: Translation supervised prototype network via residual learning for multimodal social relation extraction
2022cites this paper
Hybrid Improvements in Multimodal Analysis for Deep Video Understanding
2021cites this paper