Three-Stream Fusion Network for First-Person Interaction Recognition

Published 2020 in Pattern Recognition

ABSTRACT

First-person interaction recognition is a challenging task because of unstable video conditions resulting from the camera wearer's movement. For human interaction recognition from a first-person viewpoint, this paper proposes a three-stream fusion network with two main parts: three-stream architecture and three-stream correlation fusion. Thre three-stream architecture captures the characteristics of the target appearance, target motion, and camera ego-motion. Meanwhile the three-stream correlation fusion combines the feature map of each of the three streams to consider the correlations among the target appearance, target motion and camera ego-motion. The fused feature vector is robust to the camera movement and compensates for the noise of the camera ego-motion. Short-term intervals are modeled using the fused feature vector, and a long short-term memory(LSTM) model considers the temporal dynamics of the video. We evaluated the proposed method on two-public benchmark datasets to validate the effectiveness of our approach. The experimental results show that the proposed fusion method successfully generated a discriminative feature vector, and our network outperformed all competing activity recognition methods in first-person videos where considerable camera ego-motion occurs.

PUBLICATION RECORD

Publication year
2020
Venue
Pattern Recognition
Publication date
2020-02-19
Fields of study
Computer Science
Identifiers
DOI 10.1016/j.patcog.2020.107279 arXiv 2002.08219
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Egocentric Activity Prediction via Event Modulated Attention
2018cited by this paper
First Person Action Recognition via Two-stream ConvNet with Long-term Fusion Pooling
2018cited by this paper
Non-linear Temporal Subspace Representations for Activity Recognition
2018influential reference
Spectral–Spatial Unified Networks for Hyperspectral Image Classification
2018cited by this paper
Egocentric Activity Recognition on a Budget
2018cited by this paper
First-Person Activity Recognition Based on Three-Stream Deep Features
2018influential reference
A Long Short-Term Memory Convolutional Neural Network for First-Person Vision Activity Recognition
2017cited by this paper
Boosted multiple kernel learning for first-person activity recognition
2017cited by this paper
Modeling Sub-Event Dynamics in First-Person Action Recognition
2017cited by this paper
Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos
2017influential reference
Learning and Refining of Privileged Information-Based RNNs for Action Recognition from Depth Sequences
2017cited by this paper
Spatiotemporal Pyramid Network for Video Action Recognition
2017cited by this paper
Attentional Pooling for Action Recognition
2017cited by this paper
Stacked Convolutional Denoising Auto-Encoders for Feature Representation
2017cited by this paper
Convolutional Long Short-Term Memory Networks for Recognizing First Person Interactions
2017influential reference
Robust and Discriminative Labeling for Multi-Label Active Learning Based on Maximum Correntropy Criterion
2017cited by this paper
Action Recognition with Coarse-to-Fine Deep Feature Integration and Asynchronous Fusion
2017cited by this paper
RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos
2017cited by this paper
Convolutional Two-Stream Network Fusion for Video Action Recognition
2016influential reference
Cascaded Interactional Targeting Network for Egocentric Video Analysis
2016cited by this paper
Recognizing Micro-Actions and Reactions from Paired Egocentric Videos
2016cited by this paper
Trajectory aligned features for first person action recognition
2016cited by this paper
Spatiotemporal Residual Networks for Video Action Recognition
2016influential reference
Going Deeper into First-Person Activity Recognition
2016cited by this paper
First Person Action Recognition Using Deep Learned Descriptors
2016cited by this paper
Egocentric Daily Activity Recognition via Multitask Clustering
2015cited by this paper
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
2015cited by this paper
Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks
2015cited by this paper
First-Person Activity Recognition: Feature, Temporal Structure, and Prediction
2015cited by this paper
Delving into egocentric actions
2015cited by this paper
Towards Good Practices for Very Deep Two-Stream ConvNets
2015cited by this paper
Bilinear CNN Models for Fine-Grained Visual Recognition
2015cited by this paper
Robot-centric Activity Recognition from First-Person RGB-D Videos
2015cited by this paper
Robot-Centric Activity Prediction from First-Person Videos: What Will They Do to Me?
2015cited by this paper
Two-Stream Convolutional Networks for Action Recognition in Videos
2014influential reference
First-Person Animal Activity Recognition from Egocentric Videos
2014cited by this paper
Pooled motion features for first-person videos
2014cited by this paper
Long-term recurrent convolutional networks for visual recognition and description
2014influential reference
Learning Actionlet Ensemble for 3D Human Action Recognition
2014cited by this paper
First-Person Activity Recognition: What Are They Doing to Me?
2013influential reference
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
2012cited by this paper
Fast unsupervised ego-action learning for first-person sports videos
2011cited by this paper
A Network of Dynamic Probabilistic Models for Human Interaction Analysis
2011cited by this paper
View-independent human action recognition with Volume Motion Template on single stereo camera
2010cited by this paper
Visualizing Data using t-SNE
2008cited by this paper
Recognizing hand gestures using dynamic Bayesian network
2008cited by this paper
Learning realistic human actions from movies
2008cited by this paper
Histograms of oriented gradients for human detection
2005cited by this paper
Qualitative estimation of camera motion parameters from the linear composition of optical flow
2004cited by this paper
Recognizing human actions: a local SVM approach
2004cited by this paper
Facial component extraction and face recognition with support vector machines
2002cited by this paper
Multiple people tracking using an appearance model based on temporal color
2000cited by this paper

CITED BY

Distilling interaction knowledge for semi-supervised egocentric action recognition
2024cites this paper
Dual-branch Cross-scale Feature Interaction for Temporal Action Detection
2024cites this paper
Rewarded meta-pruning: Meta Learning with Rewards for Channel Pruning
2023cites this paper
Speeding Up Action Recognition Using Dynamic Accumulation of Residuals in Compressed Domain
2022cites this paper
HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers
2022cites this paper
Three-stream spatio-temporal attention network for first-person action and interaction recognition
2021cites this paper
Visual Question Answering based on Local-Scene-Aware Referring Expression Generation
2021cites this paper
DIGITAL MEDIA IMPLICATION ON BEHAVIORAL TRANSFORMATION; EMOTIONAL APPROACHES
2020cites this paper