The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

Raghav Goyal,Samira Ebrahimi Kahou,Vincent Michalski,Joanna Materzynska,S. Westphal,Heuna Kim,V. Haenel,Ingo Fründ,P. Yianilos,Moritz Mueller-Freitag,F. Hoppe,Christian Thurau,Ingo Bax,R. Memisevic

Published 2017 in IEEE International Conference on Computer Vision

ABSTRACT

Neural networks trained on datasets such as ImageNet have led to major advances in visual object classification. One obstacle that prevents networks from reasoning more deeply about complex scenes and situations, and from integrating visual knowledge with natural language, like humans do, is their lack of common sense knowledge about the physical world. Videos, unlike still images, contain a wealth of detailed information about the physical world. However, most labelled video datasets represent high-level concepts rather than detailed physical aspects about actions and scenes. In this work, we describe our ongoing collection of the “something-something” database of video prediction tasks whose solutions require a common sense understanding of the depicted situation. The database currently contains more than 100,000 videos across 174 classes, which are defined as caption-templates. We also describe the challenges in crowd-sourcing this data at scale.

PUBLICATION RECORD

Publication year
2017
Venue
IEEE International Conference on Computer Vision
Publication date
2017-06-13
Fields of study
Computer Science
Identifiers
DOI 10.1109/ICCV.2017.622 arXiv 1706.04261
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Dense-Captioning Events in Videos
2017cited by this paper
The Curious Robot: Learning Visual Representations via Physical Interactions
2016cited by this paper
Learning Physical Intuition of Block Towers by Example
2016cited by this paper
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
2016cited by this paper
Harnessing Object and Scene Semantics for Large-Scale Video Understanding
2016cited by this paper
Physics 101: Learning Physical Object Properties from Unlabeled Videos
2016cited by this paper
Situation Recognition: Visual Semantic Role Labeling for Image Understanding
2016cited by this paper
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
2016influential reference
Synthesizing the preferred inputs for neurons in neural networks via deep generator networks
2016cited by this paper
Learning to Poke by Poking: Experiential Learning of Intuitive Physics
2016cited by this paper
Generating captions without looking beyond objects
2016cited by this paper
Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research
2015cited by this paper
Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks
2015cited by this paper
ActivityNet: A large-scale video benchmark for human activity understanding
2015cited by this paper
Anticipating the future by watching unlabeled video
2015cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
A dataset for Movie Description
2015cited by this paper
Oracle Performance for Visual Captioning
2015cited by this paper
Video (language) modeling: a baseline for generative models of natural videos
2014cited by this paper
Learning Spatiotemporal Features with 3D Convolutional Networks
2014cited by this paper
CNN Features Off-the-Shelf: An Astounding Baseline for Recognition
2014cited by this paper
Modeling Deep Temporal Dependencies with Recurrent "Grammar Cells"
2014cited by this paper
Very Deep Convolutional Networks for Large-Scale Image Recognition
2014cited by this paper
Large-Scale Video Classification with Convolutional Neural Networks
2014cited by this paper
Going deeper with convolutions
2014cited by this paper
Vision meets robotics: The KITTI dataset
2013cited by this paper
Grounding Action Descriptions in Videos
2013cited by this paper
DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
2013cited by this paper
A database for fine grained activity detection of cooking activities
2012cited by this paper
ImageNet classification with deep convolutional neural networks
2012cited by this paper
Learning to Relate Images
2011cited by this paper
The Winograd Schema Challenge
2011influential reference
Unbiased look at dataset bias
2011cited by this paper
Curriculum learning
2009cited by this paper
ImageNet: A large-scale hierarchical image database
2009influential reference
Actions in context
2009cited by this paper
Wikipedia
2008cited by this paper
Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition
2008cited by this paper
Recognizing human actions: a local SVM approach
2004cited by this paper
Multiple View Geometry in Computer Vision
2001cited by this paper
Metaphors We Live by
1982cited by this paper

CITED BY

Boosting representation diversity in video transformers via segmented contrastive masked autoencoders
2026cites this paper
ConLA: Contrastive Latent Action Learning from Human Videos for Robotic Manipulation
2026influential citation
LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion
2026influential citation
Motion-inclusive skip connection gated recurrent unit: An end-to-end deep neural network for action recognition
2026cites this paper
InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation
2026cites this paper
Rethinking Video Generation Model for the Embodied World
2026cites this paper
Structured and Unstructured Speech2Action Frameworks for Human–Robot Collaboration: A User Study
2026cites this paper
VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
2026cites this paper
Uni-Skill: Building Self-Evolving Skill Repository for Generalizable Robotic Manipulation
2026cites this paper
HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning
2026cites this paper
Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition
2026cites this paper
ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos
2026cites this paper
Progressive Learning of Instance-Level Proxy Semantics for Few-Shot Action Recognition
2026cites this paper
Two-Stream temporal transformer for video action classification
2026cites this paper
Learning Latent Action World Models In The Wild
2026cites this paper
PredLDM: Spatiotemporal Sequence Prediction with Latent Diffusion Models
2026cites this paper
CARE: Multi-Task Pretraining for Latent Continuous Action Representation in Robot Control
2026cites this paper
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence
2026cites this paper
Spatio-temporal Decoupled Knowledge Compensator for Few-Shot Action Recognition.
2026influential citation
Beyond Language Modeling: An Exploration of Multimodal Pretraining
2026cites this paper
A Baseline Study and Benchmark for Few-Shot Open-Set Action Recognition with Feature Residual Discrimination
2026cites this paper
GeoWorld: Geometric World Models
2026cites this paper
Multimodal Behavioral Analysis for Autism Spectrum Disorder Assessment
2026cites this paper
A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation
2026cites this paper
Future Optical Flow Prediction Improves Robot Control & Video Generation
2026cites this paper
Self-Supervised Learning for Extracting Interpersonal Dynamics in Video Robbery Detection
2026cites this paper
HAhb-KG: Hierarchical Augmented Knowledge Graph for Human Behavior Assisting Cross-Modal Learning Action Detection
2026cites this paper
Temporal Consistency and Variation-Guided Spatio-Temporal Aggregation for Few-Shot Action Recognition
2026cites this paper
CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation
2026cites this paper
Video Joint-Embedding Predictive Architectures for Facial Expression Recognition
2026cites this paper
The Human Brain as a Dynamic Mixture of Expert Models in Video Understanding
2026cites this paper
TRec: Learning Hand-Object Interactions through 2D Point Track Motion
2026influential citation
Motion-aware multi-level feature fusion for few-shot action recognition
2026cites this paper
Action100M: A Large-scale Video Action Dataset
2026influential citation
PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval
2026cites this paper
Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models
2026cites this paper
BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks
2026cites this paper
Interpreting Physics in Video World Models
2026cites this paper
It's a Matter of Time: Three Lessons on Long-Term Motion for Perception
2026influential citation
Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding
2026influential citation
TrajTok: Learning Trajectory Tokens enables better Video Understanding
2026cites this paper
LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks
2026cites this paper
Hyperbolic Multiview Pretraining for Robotic Manipulation
2026cites this paper
Order Is Not Layout: Order-to-Space Bias in Image Generation
2026cites this paper
PingTactics: A Multimodal Dataset for Table Tennis Action Recognition and Tactical Analysis
2026cites this paper
Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos
2026cites this paper
Benchmarking Video Foundation Models for Remote Parkinson's Disease Screening
2026cites this paper
Deep learning for 3D skeleton-based action recognition: a comprehensive review of methods, datasets, and future directions
2026cites this paper
ProAct: A Benchmark and Multimodal Framework for Structure-Aware Proactive Response
2026cites this paper
Spatiotemporal semantic structural representation learning for image sequence prediction
2025cites this paper
HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation
2025cites this paper
Is Temporal Prompting All We Need for Limited Labeled Action Recognition?
2025cites this paper
${\text{CA}^{2}\text{ST}}$: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition
2025cites this paper
SAMPL: Self-Attention Modelled Patch Learning for Efficient Visual Understanding
2025cites this paper
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
2025cites this paper
Stme-net: spatio-temporal motion excitation network for action recognition
2025cites this paper
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
2025influential citation
Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users
2025cites this paper
ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos
2025cites this paper
Scaling Language-Free Visual Representation Learning
2025cites this paper
SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning
2025influential citation
Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions
2025cites this paper
R900: Understanding the Cost-Effectiveness of Random Exploration from 900 Hours of Robotic Data Collection
2025cites this paper
Learning from Streaming Video with Orthogonal Gradients
2025cites this paper
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks
2025influential citation
VTD-CLIP: Video-to-Text Discretization via Prompting CLIP
2025cites this paper
ATARS: An Aerial Traffic Atomic Activity Recognition and Temporal Segmentation Dataset
2025cites this paper
Appearance-Agnostic Representation Learning for Compositional Action Recognition
2025cites this paper
Advancing video self-supervised learning via image foundation models
2025cites this paper
Temporal-Spatial Redundancy Reduction in Video Sequences: A Motion-Based Entropy-Driven Attention Approach
2025cites this paper
Multi-level semantic-assisted prototype learning for Few-Shot Action Recognition
2025cites this paper
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations
2025cites this paper
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
2025cites this paper
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding
2025cites this paper
Efficient Motion-Aware Video MLLM
2025cites this paper
VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining
2025cites this paper
Make Your Training Flexible: Towards Deployment-Efficient Video Models
2025influential citation
Structured-Noise Masked Modeling for Video, Audio and Beyond
2025cites this paper
Can Vision-Language Models Answer Face to Face Questions in the Real-World?
2025cites this paper
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
2025cites this paper
COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition
2025cites this paper
Object-Centric World Model for Language-Guided Manipulation
2025cites this paper
Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding
2025cites this paper
Learning to Animate Images from A Few Videos to Portray Delicate Human Actions
2025cites this paper
Task-Aware Attentional Dynamic Alignment for Few-Shot Compressed Video Classification
2025cites this paper
Joint image-instance spatial-temporal attention for few-shot action recognition
2025cites this paper
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models
2025cites this paper
RTSA: A Run-Through Sparse Attention Framework for Video Transformer
2025cites this paper
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
2025cites this paper
An Efficient 3D Convolutional Neural Network with Channel-wise, Spatial-grouped, and Temporal Convolutions
2025influential citation
HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
2025cites this paper
Revisiting Few-Shot Compositional Action Recognition With Knowledge Calibration
2025cites this paper
VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation
2025cites this paper
Modelling object mask interaction for compositional action recognition
2025cites this paper
Granularity-Aware Contrastive Learning for Fine-Grained Action Recognition
2025cites this paper
LkSFocalNets: Video Action Recognition With Large Kernel Selective Focal Networks
2025cites this paper
Train Robots in a JIF: Joint Inverse and Forward Dynamics with Human and Robot Demonstrations
2025cites this paper
Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction
2025cites this paper
Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition
2025cites this paper
Astrea: A MOE-based Visual Understanding Model with Progressive Alignment
2025cites this paper