Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

Published 2017 in 2017 IEEE International Conference on Computer Vision Workshops (ICCVW)

ABSTRACT

Convolutional neural networks with spatio-temporal 3D kernels (3D CNNs) have an ability to directly extract spatiotemporal features from videos for action recognition. Although the 3D kernels tend to overfit because of a large number of their parameters, the 3D CNNs are greatly improved by using recent huge video databases. However, the architecture of3D CNNs is relatively shallow against to the success of very deep neural networks in 2D-based CNNs, such as residual networks (ResNets). In this paper, we propose a 3D CNNs based on ResNets toward a better action representation. We describe the training procedure of our 3D ResNets in details. We experimentally evaluate the 3D ResNets on the ActivityNet and Kinetics datasets. The 3D ResNets trained on the Kinetics did not suffer from overfitting despite the large number of parameters of the model, and achieved better performance than relatively shallow networks, such as C3D. Our code and pretrained models (e.g. Kinetics and ActivityNet) are publicly available at https://github.com/kenshohara/3D-ResNets.

PUBLICATION RECORD

Publication year
2017
Venue
2017 IEEE International Conference on Computer Vision Workshops (ICCVW)
Publication date
2017-08-25
Fields of study
Computer Science
Identifiers
DOI 10.1109/ICCVW.2017.373 arXiv 1708.07632
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
2017influential reference
The Kinetics Human Action Video Dataset
2017cited by this paper
Long-Term Temporal Convolutions for Action Recognition
2016cited by this paper
Convolutional Two-Stream Network Fusion for Video Action Recognition
2016influential reference
Densely Connected Convolutional Networks
2016cited by this paper
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
2016cited by this paper
Spatiotemporal Residual Networks for Video Action Recognition
2016cited by this paper
YouTube-8M: A Large-Scale Video Classification Benchmark
2016cited by this paper
Towards Good Practices for Very Deep Two-Stream ConvNets
2015cited by this paper
Deep Residual Learning for Image Recognition
2015influential reference
Action recognition with trajectory-pooled deep-convolutional descriptors
2015cited by this paper
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
2015cited by this paper
ActivityNet: A large-scale video benchmark for human activity understanding
2015cited by this paper
Learning Spatiotemporal Features with 3D Convolutional Networks
2014influential reference
Two-Stream Convolutional Networks for Action Recognition in Videos
2014cited by this paper
Large-Scale Video Classification with Convolutional Neural Networks
2014cited by this paper
Going deeper with convolutions
2014cited by this paper
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
2012cited by this paper
HMDB: A large video database for human motion recognition
2011cited by this paper
ImageNet: A large-scale hierarchical image database
2009influential reference
Ieee Transactions on Pattern Analysis and Machine Intelligence 1 3d Convolutional Neural Networks for Human Action Recognition
year unknowncited by this paper

CITED BY

A systematic review on human action detection and classification architectures using deep learning methodology
2026cites this paper
CCT-HAR: Cross-modal contrastive sample mining with temporal alignment for self-supervised human action recognition
2026cites this paper
Integrating psychological profiling with deep learning for enhanced boxing action recognition
2026cites this paper
WiVi-UF: Unified Feature Learning in Cross-Modal Transformers with WiFi and Vision Data Fusion for Enhanced Human Activity Recognition
2026cites this paper
Progressive multi-level learning for gloss-free sign language translation
2026cites this paper
Wi-Fitness: Improving Wi-Fi Sensing With Video Perception for Smart Fitness
2025cites this paper
Neuroplasticity-Inspired GANs for Simulating Functional Brain Regeneration and Adaptive Cognitive Recovery
2025cites this paper
A Lightweight 3D-CNN for Event-Based Human Action Recognition With Privacy-Preserving Potential
2025cites this paper
Attention-Gated CNN and Discrete Wavelet Transform based Ensemble Framework for Brain Hemorrhage Classification
2025cites this paper
LiGaussOcc: Fully Self-Supervised 3D Semantic Occupancy Prediction from LiDAR via Gaussian Splatting
2025cites this paper
Few-Shot Fingerprinting Subject Re-Identification in 3D-MRI and 2D-X-Ray
2025cites this paper
MEDIATOR: Enhancing Medical Diagnosis via Gated Distillation and Decoupled Learning
2025influential citation
FlowerAction: a federated deep learning framework for video-based human action recognition
2025cites this paper
Reduced Spatial Dependency for More General Video-level Deepfake Detection
2025cites this paper
Predicting axillary lymph node metastasis in breast cancer patients using CNN-GCN on DCE-MRI: a multicenter study
2025cites this paper
A Dual-Branch Fusion Model for Deepfake Detection Using Video Frames and Microexpression Features
2025cites this paper
Semi-supervised action recognition using logit aligned consistency and adaptive negative learning
2025cites this paper
A dual-layer state space network for Multivariate Time Series Forecasting with dimensional interdependency
2025cites this paper
Application of contrast-enhanced CT-driven multimodal machine learning models for pulmonary metastasis prediction in head and neck adenoid cystic carcinoma.
2025cites this paper
DL-KDD: Dual-Lightness Knowledge Distillation for Action Recognition in the Dark
2025cites this paper
Mitigating Surgical Data Imbalance with Dual-Prediction Video Diffusion Model
2025cites this paper
Unraveling spatiotemporal dynamics in transdiagnosis subtypes of major depressive disorder and bipolar disorder: insights from co-activation patterns and treatment response
2025cites this paper
Automated HFrEF Diagnosis Using an Optimized TimeSformer Model in Echocardiography.
2025cites this paper
Text-Aligned Radar-Based Sign Language Recognition for Healthcare Communication
2025cites this paper
Video Classification of Marchantia Polymorpha Using a Video Vision Transformer with Emphasized Channel Information
2025cites this paper
MMGC-Net: Deep neural network for classification of mineral grains using multi-modal polarization images
2025cites this paper
RWGCN: Random walk graph convolutional network for group activity recognition
2025cites this paper
Decision Support Systems in Neurosurgery: Current Applications and Future Directions
2025cites this paper
Bag-Level Multiple Instance Learning for Acute Stress Detection from Video Data
2025influential citation
Video-DPRP: A Differentially Private Approach for Visual Privacy-Preserving Video Human Activity Recognition
2025cites this paper
Automatic gesture recognition and evaluation in peg transfer tasks of laparoscopic surgery training
2025cites this paper
MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment
2025cites this paper
Deep Learning for Sports Video Event Detection: Tasks, Datasets, Methods, and Challenges
2025cites this paper
A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning
2025cites this paper
Deep learning dosiomics for the pretreatment prediction of radiation dermatitis in nasopharyngeal carcinoma patients treated with radiotherapy.
2025cites this paper
Deep Learning for Video Fluoroscopic Swallowing Study Analysis: A Survey on Classification, Detection, and Segmentation Techniques
2025cites this paper
Deep Learning-Based Event Classification of Mass Photometry Data for Optimal Mass Measurement at the Single-Molecule Level
2025cites this paper
MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition
2025cites this paper
Smartphone video-based early diagnosis of blepharospasm using dual cross-attention modeling enhanced by facial pose estimation
2025cites this paper
NEEDLE: Nurse Education Enhanced by Vision-based Deep Learning Evaluation
2025cites this paper
Dosiomics-guided deep learning for radiation esophagitis prediction in lung cancer: optimal region of interest definition via multi-branch fusion auxiliary learning.
2025cites this paper
Underground Diagnosis in 3D GPR Data by Learning in CuCoRes Model Space
2025cites this paper
EgoInstruct: An Egocentric Video Dataset of Face-to-face Instructional Interactions with Multi-modal LLM Benchmarking
2025cites this paper
SATSN: A Spatial-Adaptive Two-Stream Network for Automatic Detection of Giraffe Daily Behaviors
2025cites this paper
Alzheimer's Disease Early Prediction: A Synergistic Fusion of Deep Learning Models
2025cites this paper
Learning to align across frames: a prompt-aware framework for video action recognition
2025cites this paper
Predicting reward-based crowdfunding success with multimodal data: A theory-guided framework
2025cites this paper
Equipment-centric workpiece localization in near real-time using deep learning-based vision and event-driven finite state machines
2025cites this paper
Accelerating Violence Detection: A Comparative Study of Frame Difference and Traditional Methods on UCF Crime Dataset
2025cites this paper
Stimulus-Response Pattern: The Core of Robust Cross-Stimulus Facial Depression Recognition
2025cites this paper
Cross-Modal Consistency Learning for Sign Language Recognition
2025cites this paper
Real-Time Forgery Detection via Dynamic Frequency-Domain Selection and Phoneme Alignment
2025cites this paper
MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer
2025cites this paper
Joint 3-D Human Reconstruction and Hybrid Pose Self-Supervision for Action Recognition
2025cites this paper
LMLCC-Net: A Semi-Supervised Deep Learning Model for Lung Nodule Malignancy Prediction from CT Scans using a Novel Hounsfield Unit-Based Intensity Filtering
2025cites this paper
Bridging animal models and humans: neuroimaging as intermediate phenotypes linking genetic or stress factors to anhedonia
2025cites this paper
Dual Invariance Self-Training for Reliable Semi-Supervised Surgical Phase Recognition
2025cites this paper
Punching Bag vs. Punching Person: Motion Transferability in Videos
2025cites this paper
A Survey of FPGA-based 3D CNN Accelerators and Hardware-aware Algorithmic Optimizations
2025influential citation
Task recognition integrating worker actions and machine operations: A video-based sensing approach without physical sensors
2025cites this paper
DapFall: Dynamic Amplitude Probability Density Profile-Based Wi-Fi CSI Sensing for Fall Detection
2025cites this paper
RTSA: A Run-Through Sparse Attention Framework for Video Transformer
2025cites this paper
Accurate and Efficient Two-Stage Gun Detection in Video
2025cites this paper
VIVAR: learning view-invariant embedding for video action recognition
2025cites this paper
MemRank: Memory-Augmented Similarity Ranking for Video-Based Depression Severity Estimation
2025cites this paper
Action Recognition with 3D Residual Attention and Cross Entropy
2025cites this paper
ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages
2025cites this paper
Hybrid attention-inflated 3D architecture for human action recognition
2025cites this paper
Enhanced diagnosis of axial spondyloarthritis using machine learning with sacroiliac joint MRI: a multicenter study
2025cites this paper
Quantization-Based 3D-CNNs Through Circular Gradual Unfreezing for DeepFake Detection
2025cites this paper
TSLFormer: A Lightweight Transformer Model for Turkish Sign Language Recognition Using Skeletal Landmarks
2025cites this paper
Quantifying behavioural patterns for group-housed pigs based on deep learning and statistical analysis
2025cites this paper
Multi-Task Learning for Joint Action and Gesture Recognition
2025cites this paper
SW-ViT: A Spatio-Temporal Vision Transformer Network with Post Denoiser for Sequential Multi-Push Ultrasound Shear Wave Elastography
2025cites this paper
Parameter efficient multi-model vision assistant for polymer solvation behaviour inference
2025influential citation
Robustness Verification of Video Classification Neural Networks
2025cites this paper
IR-Based Sleep Monitoring: Movement, Respiration, and Sleep Staging With a Structure-Aware Cross-Modal EEG Knowledge Distillation
2025cites this paper
3D Visualization System of Breast Magnetic Resonance Images Based on Deep Learning and Volume Rendering
2025cites this paper
InfoSyncNet: Information Synchronization Temporal Convolutional Network for Visual Speech Recognition
2025cites this paper
Artificial intelligence supported colonoscopy bowel preparation assessment: a video-based approach
2025cites this paper
Fine tuning 3D Convolutional Networks for enhanced Action Recognition
2025cites this paper
Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation
2025cites this paper
3D CT Slice Image-Based Algorithm for Non-Wet Defect Inspection in Solder Joints
2025cites this paper
Physics-informed tensor autoencoder with memory for video anomaly detection
2025cites this paper
Automatic Visual Lip Reading: A Comparative Review of Machine-Learning Approaches
2025cites this paper
Domain-Adaptive Pretraining Improves Primate Behavior Recognition
2025influential citation
Spatio-temporal Sign Language Representation and Translation
2025cites this paper
Attention-enhanced 3D residual networks for knee abnormality classification
2025cites this paper
DualFocus: Depth from Focus with Spatio-Focal Dual Variational Constraints
2025cites this paper
JSS-CLIP: Boosting image-to-video transfer learning with JigSaw side network
2025cites this paper
A Novel Multimodal Hand Gesture Recognition Model Using Combined Approach of Inter-Frame Motion and Shared Attention Weights
2025cites this paper
Biomechanically Consistent Real-Time Action Recognition for Human-Robot Interaction
2025cites this paper
VP2Net: Visual Perception-Inspired Network for Exploring the Causes of Drivers’ Attention Shift
2025cites this paper
EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos referring to Procedural Texts
2024cites this paper
Shared representations of human actions across vision and language
2024cites this paper
A Heatmap-Based Weighted Multi-stream Fusion Network in Skeleton Modality for Action Recognition
2024cites this paper
HDD4DBP: A Large-Scale Multi-Modal Benchmark on Driving Behavior Prediction
2024cites this paper
Intelligent Recognition System of Nursing Students’ Procedural Steps of Cardiopulmonary Resuscitation Based on 3D-ResNet
2024cites this paper
Spatial-spectral joint preprocessing for hyperspectral image analysis using 3D-ResNet: Application to coal ash content estimation
2024cites this paper
AViSal360: Audiovisual Saliency Prediction for 360° Video
2024cites this paper