Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

Published 2017 in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

ABSTRACT

The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels. Recently, the performance levels of 3D CNNs in the field of action recognition have improved significantly. However, to date, conventional research has only explored relatively shallow 3D architectures. We examine the architectures of various 3D CNNs from relatively shallow to very deep ones on current video datasets. Based on the results of those experiments, the following conclusions could be obtained: (i) ResNet-18 training resulted in significant overfitting for UCF-101, HMDB-51, and ActivityNet but not for Kinetics. (ii) The Kinetics dataset has sufficient data for training of deep 3D CNNs, and enables training of up to 152 ResNets layers, interestingly similar to 2D ResNets on ImageNet. ResNeXt-101 achieved 78.4% average accuracy on the Kinetics test set. (iii) Kinetics pretrained simple 3D architectures outperforms complex 2D architectures, and the pretrained ResNeXt-101 achieved 94.5% and 70.2% on UCF-101 and HMDB-51, respectively. The use of 2D CNNs trained on ImageNet has produced significant progress in various tasks in image. We believe that using deep 3D CNNs together with Kinetics will retrace the successful history of 2D CNNs and ImageNet, and stimulate advances in computer vision for videos. The codes and pretrained models used in this study are publicly available1.

PUBLICATION RECORD

Publication year
2017
Venue
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Publication date
2017-11-27
Fields of study
Computer Science
Identifiers
DOI 10.1109/CVPR.2018.00685 arXiv 1711.09577
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
2017influential reference
Spatiotemporal Multiplier Networks for Video Action Recognition
2017cited by this paper
ConvNet Architecture Search for Spatiotemporal Feature Learning
2017influential reference
Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks
2017cited by this paper
The Kinetics Human Action Video Dataset
2017influential reference
Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition
2017influential reference
Identity Mappings in Deep Residual Networks
2016influential reference
Long-Term Temporal Convolutions for Action Recognition
2016cited by this paper
Spatiotemporal Residual Networks for Video Action Recognition
2016cited by this paper
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
2016cited by this paper
Aggregated Residual Transformations for Deep Neural Networks
2016cited by this paper
Densely Connected Convolutional Networks
2016influential reference
YouTube-8M: A Large-Scale Video Classification Benchmark
2016cited by this paper
Convolutional Two-Stream Network Fusion for Video Action Recognition
2016cited by this paper
Wide Residual Networks
2016influential reference
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
2015cited by this paper
Deep Residual Learning for Image Recognition
2015influential reference
Action recognition with trajectory-pooled deep-convolutional descriptors
2015cited by this paper
ActivityNet: A large-scale video benchmark for human activity understanding
2015influential reference
Towards Good Practices for Very Deep Two-Stream ConvNets
2015cited by this paper
Two-Stream Convolutional Networks for Action Recognition in Videos
2014cited by this paper
Learning Spatiotemporal Features with 3D Convolutional Networks
2014influential reference
Large-Scale Video Classification with Convolutional Neural Networks
2014cited by this paper
Going deeper with convolutions
2014cited by this paper
Dense Trajectories and Motion Boundary Descriptors for Action Recognition
2013cited by this paper
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
2012cited by this paper
HMDB: A large video database for human motion recognition
2011influential reference
Rectified Linear Units Improve Restricted Boltzmann Machines
2010cited by this paper
ImageNet: A large-scale hierarchical image database
2009cited by this paper
Ieee Transactions on Pattern Analysis and Machine Intelligence 1 3d Convolutional Neural Networks for Human Action Recognition
year unknowncited by this paper

CITED BY

Enhanced medical image segmentation via synergistic feature guidance and multi-scale refinement
2026cites this paper
DRAD: A new model for dynamic real-time Avalanche detection from videos with residual depth-separable convolution and feature pyramid networks
2026cites this paper
UV-M3TL: A Unified and Versatile Multimodal Multi-Task Learning Framework for Assistive Driving Perception
2026cites this paper
Subjective-Objective Emotion-Correlated Generation Network for Subjective Video Captioning
2026cites this paper
AI–assisted multimodal assessment for right ventricular function from echocardiography predicts mortality in patients with pulmonary hypertension and right heart failure
2026cites this paper
Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection
2026cites this paper
Enhancing video captioning with contextual anchor-guided semantic modeling
2026cites this paper
Event-based Facial Expression Recognition via Large Vision-Language Models
2026cites this paper
MSDM: A Lightweight Multi-Scale Dynamic Mamba for Dynamic Facial Expression Recognition in Smart Classrooms
2026cites this paper
Multi-scale BiTemporal fusion for dynamic facial expression recognition in the wild
2026cites this paper
Towards Generalized Video Captioning: An Effective Multi-modal Knowledge Graph Perspective
2026cites this paper
CLVSR: Concept-Guided Language-Visual Feature Learning and Sample Rebalance for Dynamic Facial Expression Recognition
2026cites this paper
Spatiotemporal video encoders and zero-shot segmentation for 3D action recognition and behavior analysis of broiler chickens associated with different welfare indicators and body weight
2026cites this paper
Spaceborne SAR Operating Mode Recognition Network Based on Azimuth Time-Frequency Amplitude Sequence
2026cites this paper
Enhancing Perceptron Constancy for Real-World Dynamic Hand Gesture Authentication
2026cites this paper
Predicting Conversion from Mild Cognitive Impairment to Alzheimer's Disease Using a Vision Transformer and Hippocampal MRI Slices.
2026cites this paper
OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation
2026influential citation
Synergistic hierarchical and graph attention networks for robust leptomeningeal metastasis detection in brain MRI
2026cites this paper
Ask and Focus More: Question-Prompt Uncertainty Allocation for Dual-Controllable Video Captioning
2026cites this paper
A unified deep network for thin- and dense-slice reconstruction: Improving through-plane resolution in clinical MRI
2026cites this paper
A Lightweight Spatio-Temporal Skeleton Attention Transformer for Long-Distance Gesture Recognition in UAV Control
2026cites this paper
LitePMix: Facial action recognition under full facial occlusion for few-shot learning
2026cites this paper
Exposing and Defending the Achilles'Heel of Video Mixture-of-Experts
2026cites this paper
CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation
2026cites this paper
SD-YOLOv8: Automated Motion Detection System for Aerobics Students
2025cites this paper
ChronoTailor: Harnessing Attention Guidance for Fine-Grained Video Virtual Try-On
2025cites this paper
MVC: Multi-stage video caption generation model based on multi-modality
2025cites this paper
Time-Lapse Video-Based Embryo Grading via Complementary Spatial-Temporal Pattern Mining
2025cites this paper
Synthetic Human Action Video Data Generation with Pose Transfer
2025cites this paper
Temporal Attention-based Vision Transformer for Source-Free Video Unsupervised Domain Adaptation
2025cites this paper
MRASM: A multiscale residual attention spatiotemporal model for breast tumor prediction.
2025cites this paper
GeoModel: Technology and Innovation for a Diverse and Interactive Geometric Education
2025cites this paper
AgriFM: A Multi-source Temporal Remote Sensing Foundation Model for Crop Mapping
2025influential citation
Parameter efficient multi-model vision assistant for polymer solvation behaviour inference
2025cites this paper
On Using Spatial and Temporal Features for Robust Multiuser Pose Estimation Based on WiFi CSI
2025cites this paper
BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation
2025cites this paper
The Journey of Action Recognition
2025cites this paper
Spatiotemporal Analysis of Forest Machine Operations Using 3D Video Classification
2025cites this paper
Attention-based multimodal deep learning for interpretable and generalizable prediction of pathological complete response in breast cancer
2025cites this paper
An integration framework based on deep learning and CFD for early detection of lithium-ion battery thermal runaway
2025cites this paper
Anomaly detection method of surveillance video based on global-local information
2025cites this paper
TRIDENT: Tri-modal Real-time Intrusion Detection Engine for New Targets
2025cites this paper
Efficient Human Action Recognition With Fine-Grained Spatiotemporal Feature Extraction From Millimeter-Wave Point Clouds
2025cites this paper
EML-SlowFast: A behavior recognition model for lion-head goose
2025cites this paper
Hybrid attention-inflated 3D architecture for human action recognition
2025cites this paper
3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models
2025cites this paper
Mission Balance: Generating Under-represented Class Samples using Video Diffusion Models
2025cites this paper
A novel YOLO LSTM approach for enhanced human action recognition in video sequences
2025cites this paper
Multi-stage query-based feature generating and encoding for robust early action recognition
2025cites this paper
Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction
2025cites this paper
EAVFormer: an end-to-end audio and visual emotion recognition network based on transformers
2025cites this paper
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering
2025cites this paper
ScanAhead: Simplifying standard plane acquisition of fetal head ultrasound
2025cites this paper
VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning
2025cites this paper
Supervised regularized attention-aware clock-triggered recurrent neural network for video summarization
2025cites this paper
Appearance-Agnostic Representation Learning for Compositional Action Recognition
2025cites this paper
EAViz: a user-friendly deep learning-based epilepsy analysis visualizer using multimodal data
2025cites this paper
Robustness Evaluation for Video Models with Reinforcement Learning
2025cites this paper
Enhancing Video Memorability Prediction with Text-Motion Cross-modal Contrastive Loss and Its Application in Video Summarization
2025cites this paper
Violence Recognition with Adaptive Temporal Down-Sampling
2025cites this paper
Low-Barrier Dataset Collection with Real Human Body for Interactive Per-Garment Virtual Try-On
2025cites this paper
Self-supervised Bidirectional Synchronization Estimation for Multimodal Deepfake Detection with Short-term Dependency
2025cites this paper
MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception
2025cites this paper
OwlSight: A Robust Illumination Adaptation Framework for Dark Video Human Action Recognition
2025influential citation
AsyReC: A Multimodal Graph-Based Framework for Spatio-Temporal Asymmetric Dyadic Relationship Classification
2025cites this paper
SIMMA: Multimodal Automatic Depression Detection via Spatiotemporal Ensemble and Cross-Modal Alignment
2025cites this paper
Multi-Prototype Grouping for Continual Learning in Visual Question Answering
2025cites this paper
Enhancing patient-level diagnosis on MR images: a multi-instance learning framework with contributive feature mining
2025cites this paper
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation
2025cites this paper
A Dynamic Prognostic Prediction Method for Colorectal Cancer Liver Metastasis
2025cites this paper
Comparative Analysis of Convolutional Neural Networks on The Stability and Performance for Micro-Expression Recognition
2025cites this paper
Vehicle lane change behavior recognition based on multi-scale three-stream 3D ResNets
2025cites this paper
Robust Dynamic Facial Expression Recognition
2025influential citation
AVERFormer: End-to-end audio-visual emotion recognition transformer framework with balanced modal contributions
2025cites this paper
Video-DPRP: A Differentially Private Approach for Visual Privacy-Preserving Video Human Activity Recognition
2025cites this paper
Safeguarding AI in Medical Imaging: Post-Hoc Out-of-Distribution Detection with Normalizing Flows.
2025cites this paper
Video Captioning Method Based on Semantic Topic Association
2025cites this paper
Data augmented lung cancer prediction framework using the nested case control NLST cohort
2025cites this paper
Ai-Driven Automated Tool for Abdominal CT Body Composition Analysis in Gastrointestinal Cancer Management
2025cites this paper
Advancing Dark Action Recognition via Modality Fusion and Dark-to-Light Diffusion Model
2025cites this paper
Tables Guide Vision: Learning to See the Heart through Tabular Data
2025cites this paper
Agentic Keyframe Search for Video Question Answering
2025cites this paper
Stme-net: spatio-temporal motion excitation network for action recognition
2025cites this paper
The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning
2025cites this paper
Hybrid Attention Vision Transformer-based Deep Learning Model for Video Caption Generation
2025cites this paper
Beyond Static Scenes: Camera-controllable Background Generation for Human Motion
2025cites this paper
Spatiotemporal uncertainty guided non maximum suppression for video event detection
2025cites this paper
The dual stream network with embedding temporal convolution for micro-expression recognition
2025cites this paper
Aerial Video Classification by Integrating Global-Local Semantics in ConvNets
2025influential citation
Vision-Language Adaptive Clustering and Meta-Adaptation for Unsupervised Few-Shot Action Recognition
2025cites this paper
UMD-Net: A Unified Multi-Task Assistive Driving Network Based on Multimodal Fusion
2025cites this paper
Fast Adversarial Training With Weak-to-Strong Spatial-Temporal Consistency in the Frequency Domain on Videos
2025cites this paper
Attention mechanism based multimodal feature fusion network for human action recognition
2025cites this paper
MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment
2025cites this paper
AI-Enabled Accurate Non-Invasive Assessment of Pulmonary Hypertension Progression via Multi-Modal Echocardiography
2025cites this paper
YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences
2025cites this paper
AFES: Attention-Based Feature Excitation and Sorting for Action Recognition
2025cites this paper
Deep learning-based computer-aided diagnostic system for lumbar degenerative diseases classification using MRI
2025cites this paper
Multi-modal Collaborative Optimization and Expansion Network for Event-assisted Single-eye Expression Recognition
2025cites this paper
DeepGuard: Enhancing Violence Detection in Smart Cities Through Deep Learning
2025cites this paper