Unsupervised Learning of Video Representations using LSTMs

Nitish Srivastava,Elman Mansimov,R. Salakhutdinov

Published 2015 in International Conference on Machine Learning

ABSTRACT

We use Long Short Term Memory (LSTM) networks to learn representations of video sequences. Our model uses an encoder LSTM to map an input sequence into a fixed length representation. This representation is decoded using single or multiple decoder LSTMs to perform different tasks, such as reconstructing the input sequence, or predicting the future sequence. We experiment with two kinds of input sequences - patches of image pixels and high-level representations ("percepts") of video frames extracted using a pretrained convolutional net. We explore different design choices such as whether the decoder LSTMs should condition on the generated output. We analyze the outputs of the model qualitatively to see how well the model can extrapolate the learned video representation into the future and into the past. We further evaluate the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets. We show that the representations help improve classification accuracy, especially when there are only few training examples. Even models pretrained on unrelated datasets (300 hours of YouTube videos) can help action recognition performance.

PUBLICATION RECORD

Publication year
2015
Venue
International Conference on Machine Learning
Publication date
2015-02-16
Fields of study
Computer Science
Identifiers
arXiv 1502.04681
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

DRAW: A Recurrent Neural Network For Image Generation
2015cited by this paper
Show and tell: A neural image caption generator
2014cited by this paper
Beyond Gaussian Pyramid: Multi-skip Feature Stacking for action recognition
2014cited by this paper
Sequence to Sequence Learning with Neural Networks
2014cited by this paper
Video (language) modeling: a baseline for generative models of natural videos
2014influential reference
Very Deep Convolutional Networks for Large-Scale Image Recognition
2014influential reference
Two-Stream Convolutional Networks for Action Recognition in Videos
2014influential reference
Large-Scale Video Classification with Convolutional Neural Networks
2014cited by this paper
C3D: Generic Features for Video Analysis
2014cited by this paper
Recurrent Neural Network Regularization
2014cited by this paper
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
2014cited by this paper
Long-term recurrent convolutional networks for visual recognition and description
2014cited by this paper
Towards End-To-End Speech Recognition with Recurrent Neural Networks
2014cited by this paper
Modeling Deep Temporal Dependencies with Recurrent "Grammar Cells"
2014cited by this paper
Generating Sequences With Recurrent Neural Networks
2013cited by this paper
Auto-Encoding Variational Bayes
2013cited by this paper
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
2012cited by this paper
Learning to Relate Images
2011cited by this paper
Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis
2011cited by this paper
Modeling the joint density of two images under a variety of transformations
2011cited by this paper
HMDB: A large video database for human motion recognition
2011cited by this paper
Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines
2010cited by this paper
Deep learning from temporal coherence in video
2009cited by this paper
Image quality assessment: from error visibility to structural similarity
2004cited by this paper
Simple-Cell-Like Receptive Fields Maximize Temporal Coherence in Natural Video
2003cited by this paper
Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex
1998cited by this paper
Long Short-Term Memory
1997cited by this paper
Ieee Transactions on Pattern Analysis and Machine Intelligence 1 3d Convolutional Neural Networks for Human Action Recognition
year unknowncited by this paper
Author manuscript, published in "International Conference on Computer Vision (2013)" Action Recognition with Improved Trajectories
year unknowncited by this paper

CITED BY

WinG-LSTM: a precipitation nowcasting model integrating swin transformer and LSTM
2026cites this paper
Self-Supervised Learning from Structural Invariance
2026influential citation
Improving precipitation nowcasting via multiphysical parameter fusion in radar echo extrapolation
2026cites this paper
PredLDM: Spatiotemporal Sequence Prediction with Latent Diffusion Models
2026cites this paper
A systematic review on human action detection and classification architectures using deep learning methodology
2026cites this paper
Graph-Enhanced Long-Term Indoor Behavior Prediction With Attention Mechanisms
2026cites this paper
Implicit hierarchical temporal-spatial residual model for long-term video prediction.
2026cites this paper
Enhancing multivariate weather forecasting via temporal attention and spatiotemporal fusion
2026cites this paper
Advances in digital twins and AI/ML for condition monitoring in nuclear applications
2026cites this paper
Temporal modeling with reversible transformers
2026cites this paper
PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning
2026cites this paper
A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures
2026cites this paper
Self-Supervised Learning for Extracting Interpersonal Dynamics in Video Robbery Detection
2026cites this paper
CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions
2026cites this paper
VideoMamba-LLaVA: Efficient Video Question Answering with Selective Scanning and Lightweight Inference
2025cites this paper
Gather-Scatter Mamba: Accelerating Propagation with Efficient State Space Model
2025cites this paper
Deep Learning for Regular Raster Spatio-Temporal Prediction: An Overview
2025cites this paper
A Comprehensive Survey on Text-to-Video Generation
2025cites this paper
Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers
2025cites this paper
GHS-VDG:Graph and Hybrid Spatio-Temporal Attention for Video Diffusion Generation
2025cites this paper
A Comparative Study of LSTM Efficiency vs. Transformer Power for Localized Time Series Forecasting
2025cites this paper
Enhancing Scene Transition Awareness in Video Generation via Post-Training
2025cites this paper
Geometry-aware 4D Video Generation for Robot Manipulation
2025cites this paper
Interpretable AI for explaining and predicting battery state of health using PSO-enhanced deep learning models
2025cites this paper
Physics-aware parameter diffusion for spatio-temporal continuum field modeling for governance
2025cites this paper
Kdhiera: boosting self-supervised masked video modeling via hierarchical knowledge distillation
2025cites this paper
Learning-Aided Evolutionary Algorithm for Solving Energy-Minimized Deadline-Constrained Task Scheduling Problem in Human-Cyber-Physical Systems
2025cites this paper
REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport
2025cites this paper
Too Clever by Half: Detecting Sampling-based Model Stealing Attacks by Their Own Cleverness
2025cites this paper
Time-Correlated Video Bridge Matching
2025cites this paper
Normalize Filters! Classical Wisdom for Deep Vision
2025cites this paper
Life Pattern-Based Human Attribute Estimation Using Only GPS Data
2025cites this paper
Axial-UNet: A Neural Weather Model for Precipitation Nowcasting
2025cites this paper
A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning
2025cites this paper
MMST-LSTM: Leveraging Radar Echo Prediction for Emerging Consumer Applications in Edge Computing
2025cites this paper
VISC: mmWave Radar Scene Flow Estimation using Pervasive Visual-Inertial Supervision
2025cites this paper
Probabilistic Forecasting for Dynamical Systems with Missing or Imperfect Data
2025cites this paper
Disentangled and reassociated deep representation for dynamic survival analysis with competing risks
2025cites this paper
UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines
2025cites this paper
Spatiotemporal semantic structural representation learning for image sequence prediction
2025cites this paper
Low-resolution human pose estimation and action recognition via pose-driven super-resolution reconstruction
2025cites this paper
Intelligent Human Behaviour Detection Leveraging ConvLSTM and LRCN Deep Learning Approaches
2025cites this paper
The Journey of Action Recognition
2025influential citation
A Global Spatial-Temporal Attention method based on motion decomposition for precipitation nowcasting
2025cites this paper
UST-SU: a U-shaped video prediction network based on partial autoregression
2025cites this paper
RadarSeq: A Temporal Vision Framework for User Churn Prediction via Radar Chart Sequences
2025cites this paper
State-space modeling in long sequence processing: a survey on recurrence in the transformer era
2025cites this paper
Representation learning with a transformer by contrastive learning for money laundering detection
2025cites this paper
Using Instruction-Following LLM Hidden States as Conditioning for Video Diffusion Model
2025cites this paper
Aligning Moments in Time using Video Queries
2025cites this paper
A Spatial-Aware Temporal Modeling Network for Imitation Learning-Based Drone Navigation
2025cites this paper
Deep graph neural networks for spatiotemporal forecasting of sub-seasonal sea ice: A case study in Hudson Bay
2025cites this paper
Behavior-Enhanced Representation Learning for User Behavior Analysis
2025cites this paper
Diffractive tensorized unit for million-TOPS general-purpose computing
2025cites this paper
Towards Robust Deterministic and Probabilistic Modeling for Predictive Learning
2025cites this paper
Spatiotemporal Interactive Pedestrian Intention Prediction With Illumination Transformation
2025cites this paper
Disjointed representation learning for better fall recognition
2025cites this paper
Temporal vs. Spatial: Comparing DINOv3 and V-JEPA2 Feature Representations for Video Action Analysis
2025cites this paper
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI
2025cites this paper
Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics
2025cites this paper
Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling
2025cites this paper
MultiCOIN: Multi-Modal COntrollable Video INbetweening
2025cites this paper
Using Autoencoders and Automatic Differentiation to Reconstruct Missing Variables in a Set of Time Series
2025cites this paper
Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment
2025cites this paper
MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation
2025cites this paper
MVSAPNet: A Multivariate Data-Driven Method for Detecting Disc Cutter Wear States in Composite Strata Shield Tunneling
2025cites this paper
EOST-LSTM: Long Short-Term Memory Model Combined with Attention Module and Full-Dimensional Dynamic Convolution Module
2025cites this paper
Image Motion Blur Removal in the Temporal Dimension with Video Diffusion Models
2025cites this paper
A masked autoencoder network for spatiotemporal predictive learning
2025cites this paper
A novel hybrid architecture for video frame prediction: combining convolutional LSTM and 3D CNN
2025cites this paper
Neural network modeling of microstructure complexity using digital libraries
2025cites this paper
On Self-Adaptive Perception Loss Function for Sequential Lossy Compression
2025cites this paper
Spectral graph convolution networks for microbialite lithology identification based on conventional well logs
2025cites this paper
C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation
2025cites this paper
Text2Story: Advancing Video Storytelling with Text Guidance
2025cites this paper
PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop
2025cites this paper
Low correlation portfolio formation with preselection using rich relational data
2025cites this paper
GMG: A Video Prediction Method Based on Global Focus and Motion Guided
2025cites this paper
Generating Long-Take Videos via Effective Keyframes and Guidance
2025cites this paper
Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs)
2025cites this paper
Exploiting Temporal State Space Sharing for Video Semantic Segmentation
2025cites this paper
Learning to predict rare events: the case of abnormal grain growth
2025cites this paper
SRVP: Strong Recollection Video Prediction Model Using Attention-Based Spatiotemporal Correlation Fusion
2025influential citation
DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation
2025cites this paper
Enhancing Diesel Backup Power Forecasting With LSTM, GRU, and Autoencoder-based Input Encoding
2025cites this paper
Comparative Study on Curiosity with Attention, Memory and Empowerment
2025cites this paper
ST-GRU: spatiotemporal gated recurrent unit for video prediction
2025influential citation
Self-Supervised Learning for Image Segmentation: A Comprehensive Survey
2025cites this paper
DATC-STP: Towards Accurate yet Efficient Spatiotemporal Prediction With Transformer-Style CNN
2025cites this paper
GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking
2025cites this paper
CCLSTM: Coupled Convolutional Long-Short Term Memory Network for Occupancy Flow Forecasting
2025cites this paper
A Convolutional Recurrent Bottleneck and Vertical Nonlocal Operations for the Weakly Supervised Semantic Segmentation of Radar Sounder Data
2025cites this paper
Efficient Autoencoders for Anomaly Detection in Distributed Acoustic Sensing Data
2025cites this paper
Hierarchical Temporal Sequence Segmentation for weakly supervised video anomaly detection
2025cites this paper
A Fast Identification Method for Seismic Responses of Bridge Structures by Integrating Digital Signal Features and Deep Learning
2025cites this paper
DIFFUMA: High-Fidelity Spatio-Temporal Video Prediction via Dual-Path Mamba and Diffusion Enhancement
2025cites this paper
Cross-Modal Contrastive Masked AutoEncoder for Compressed Video Pre-Training
2025cites this paper
Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport
2025cites this paper
Fast and Flexible Probabilistic Forecasting of Dynamical Systems using Flow Matching and Physical Perturbation
2025cites this paper
A Hybrid Approach for Sports Activity Recognition Using Key Body Descriptors and Hybrid Deep Learning Classifier
2025cites this paper