The Pose Knows: Video Forecasting by Generating Pose Futures

Jacob Walker,Kenneth Marino,A. Gupta,M. Hebert

Published 2017 in IEEE International Conference on Computer Vision

ABSTRACT

Current approaches to video forecasting attempt to generate videos directly in pixel space using Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs). However, since these approaches try to model all the structure and scene dynamics at once, in unconstrained settings they often generate uninterpretable results. Our insight is that forecasting needs to be done first at a higher level of abstraction. Specifically, we exploit human pose detectors as a free source of supervision and break the video forecasting problem into two discrete steps. First we explicitly model the high level structure of active objects in the scene (humans) and use a VAE to model the possible future movements of humans in the pose space. We then use the future poses generated as conditional information to a GAN to predict the future frames of the video in pixel space. By using the structured space of pose as an intermediate representation, we sidestep the problems that GANs have in generating video pixels directly. We show through quantitative and qualitative evaluation that our method outperforms state-of-the-art methods for video prediction.

PUBLICATION RECORD

Publication year
2017
Venue
IEEE International Conference on Computer Vision
Publication date
2017-04-28
Fields of study
Computer Science
Identifiers
DOI 10.1109/ICCV.2017.361 arXiv 1705.00053
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

GENERATIVE ADVERSARIAL NETS
2018cited by this paper
Déjà Vu: - Motion Prediction in Static Images
2018cited by this paper
Multi-context Attention for Human Pose Estimation
2017cited by this paper
Learning Activity Progression in LSTMs for Activity Detection and Early Detection
2016cited by this paper
Convolutional Pose Machines
2016cited by this paper
An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders
2016influential reference
Learning What and Where to Draw
2016cited by this paper
Generative Image Modeling Using Style and Structure Adversarial Networks
2016cited by this paper
Improved Techniques for Training GANs
2016cited by this paper
Dynamic Filter Networks
2016cited by this paper
Context Encoders: Feature Learning by Inpainting
2016cited by this paper
Unsupervised Learning for Physical Interaction through Video Prediction
2016cited by this paper
Learning Social Etiquette: Human Trajectory Understanding In Crowded Scenes
2016cited by this paper
Image-to-Image Translation with Conditional Adversarial Networks
2016influential reference
Pixel Recurrent Neural Networks
2016cited by this paper
Online Semantic Activity Forecasting with DARKO
2016cited by this paper
Generating Videos with Scene Dynamics
2016influential reference
Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks
2016influential reference
Stacked Hourglass Networks for Human Pose Estimation
2016cited by this paper
First-Person Activity Forecasting with Online Inverse Reinforcement Learning
2016cited by this paper
Video Pixel Networks
2016cited by this paper
Conditional Image Generation with PixelCNN Decoders
2016cited by this paper
Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields
2016cited by this paper
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
2016cited by this paper
Supervision Beyond Manual Annotations for Learning Visual Representations
2016cited by this paper
Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks
2015cited by this paper
Generative Moment Matching Networks
2015cited by this paper
Spatio-temporal video autoencoder with differentiable memory
2015cited by this paper
Training generative neural networks via Maximum Mean Discrepancy optimization
2015cited by this paper
ActivityNet: A large-scale video benchmark for human activity understanding
2015cited by this paper
Towards Good Practices for Very Deep Two-Stream ConvNets
2015cited by this paper
Anticipating the future by watching unlabeled video
2015cited by this paper
Recurrent Network Models for Human Dynamics
2015influential reference
Unsupervised Learning of Video Representations using LSTMs
2015influential reference
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
2015cited by this paper
Deep multi-scale video prediction beyond mean square error
2015cited by this paper
Structural-RNN: Deep Learning on Spatio-Temporal Graphs
2015cited by this paper
Dense Optical Flow Prediction from a Static Image
2015cited by this paper
Markov Chain Monte Carlo and Variational Inference: Bridging the Gap
2014cited by this paper
A Hierarchical Representation for Future Action Prediction
2014cited by this paper
Patch to the Future: Unsupervised Visual Prediction
2014cited by this paper
Large-Scale Video Classification with Convolutional Neural Networks
2014cited by this paper
Going deeper with convolutions
2014cited by this paper
Stochastic Backpropagation and Approximate Inference in Deep Generative Models
2014cited by this paper
Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments
2014cited by this paper
Video (language) modeling: a baseline for generative models of natural videos
2014influential reference
From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
2013cited by this paper
Auto-Encoding Variational Bayes
2013influential reference
A Kernel Two-Sample Test
2012influential reference
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
2012cited by this paper
Max-Margin Early Event Detectors
2012cited by this paper
ImageNet classification with deep convolutional neural networks
2012cited by this paper
HMDB: A large video database for human motion recognition
2011cited by this paper
A Data-Driven Approach for Event Prediction
2010cited by this paper
Dynamic
2009cited by this paper
Task assignment with unknown duration
2000cited by this paper

CITED BY

Towards Diverse and Natural Stochastic Human Motion Prediction With a Novel Multiobjective Optimization Framework
2026cites this paper
CDPT: context-driven omni-dimensional dynamic pose transfer network
2025cites this paper
Learning Statistical and Physical Modeling for Consistency Human Motion Prediction
2025cites this paper
HHI-Assist: A Dataset and Benchmark of Human-Human Interaction in Physical Assistance Scenario
2025cites this paper
CacheFlow: Fast Human Motion Prediction by Cached Normalizing Flow
2025influential citation
MMCM: Multimodality-aware Metric using Clustering-based Modes for Probabilistic Human Motion Prediction
2025cites this paper
A method for stochastic human action prediction based on denoising diffusion probability model
2025influential citation
A Spatio-Temporal Continuous Network for Stochastic 3D Human Motion Prediction
2025cites this paper
Towards efficient real-time video motion transfer via generative time series modeling
2025cites this paper
Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction
2025cites this paper
Simulating the Real World: A Unified Survey of Multimodal Generative Models
2025cites this paper
Spatiotemporal semantic structural representation learning for image sequence prediction
2025cites this paper
Controllable Video Generation With Text-Based Instructions
2024cites this paper
Enhancing Bandwidth Efficiency for Video Motion Transfer Applications using Deep Learning Based Keypoint Prediction
2024cites this paper
Motion Diversification Networks
2024cites this paper
ANFluid: Animate Natural Fluid Photos base on Physics-Aware Simulation and Dual-Flow Texture Learning
2024cites this paper
Continuous Video Process: Modeling Videos as Continuous Multi-Dimensional Processes for Video Prediction
2024cites this paper
A Survey on Future Frame Synthesis: Bridging Deterministic and Generative Approaches
2024cites this paper
LIA: Latent Image Animator
2024cites this paper
Collaboratively Self-Supervised Video Representation Learning for Action Recognition
2024cites this paper
Video to Video Generative Adversarial Network for Few-Shot Learning Based on Policy Gradient
2024cites this paper
Make static person walk again via separating pose action from shape
2024influential citation
PISE-V: person image and video synthesis with decoupled GAN
2024cites this paper
Action-guided CycleGAN for Bi-directional Video Prediction
2024cites this paper
Diverse Motion In-Betweening From Sparse Keyframes With Dual Posture Stitching
2024cites this paper
Ganetic loss for generative adversarial networks with a focus on medical applications
2024cites this paper
Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey
2024cites this paper
DivDiff: A Conditional Diffusion Model for Diverse Human Motion Prediction
2024cites this paper
Prompting Future Driven Diffusion Model for Hand Motion Prediction
2024cites this paper
Seamless Human Motion Composition with Blended Positional Encodings
2024cites this paper
MotionMap: Representing Multimodality in Human Pose Forecasting
2024cites this paper
Multilevel Joint Association Networks for Diverse Human Motion Prediction
2024influential citation
A Benchmark for Controllable Text -Image-to-Video Generation
2024cites this paper
Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes
2024cites this paper
BSTG-Trans: A Bayesian Spatial-Temporal Graph Transformer for Long-Term Pose Forecasting
2024influential citation
FABEL: Forecasting Animal Behavioral Events with Deep Learning-Based Computer Vision
2024cites this paper
Diverse Motion In-betweening with Dual Posture Stitching
2023cites this paper
EqMotion: Equivariant Multi-Agent Motion Prediction with Invariant Interaction Reasoning
2023cites this paper
LEO: Generative Latent Image Animator for Human Video Synthesis
2023cites this paper
PCFN: Progressive Cross-Modal Fusion Network for Human Pose Transfer
2023cites this paper
Dual-task attention-guided character image generation method
2023cites this paper
Diverse Human Motion Prediction Guided by Multi-level Spatial-Temporal Anchors
2023cites this paper
Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision
2023cites this paper
What Happened 3 Seconds Ago? Inferring the Past with Thermal Imaging
2023cites this paper
Stochastic Multi-Person 3D Motion Forecasting
2023cites this paper
Music2Play: Audio-Driven Instrumental Animation
2023cites this paper
Appearance and Pose-guided Human Generation: A Survey
2023influential citation
Disentangled and Parallel Fusion Graph Neural Network for Human Motion Prediction
2023cites this paper
ReactFace: Online Multiple Appropriate Facial Reaction Generation in Dyadic Interactions
2023cites this paper
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
2023cites this paper
Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction
2023cites this paper
Scene-aware Human Pose Generation using Transformer
2023cites this paper
AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism
2023cites this paper
Investigating Low Data, Confidence Aware Image Prediction on Smooth Repetitive Videos using Gaussian Processes
2023cites this paper
SCRN: Stepwise Change and Refine Network Based Semantic Distribution for Human Pose Transfer
2023cites this paper
TransFusion: A Practical and Effective Transformer-Based Diffusion Model for 3D Human Motion Prediction
2023cites this paper
CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion Diffusion
2023cites this paper
How Generative Adversarial Networks Promote the Development of Intelligent Transportation Systems: A Survey
2023cites this paper
Pedestrian Crossing Action Recognition and Trajectory Prediction with 3D Human Keypoints
2023cites this paper
AnyPose: Anytime 3D Human Pose Forecasting via Neural Ordinary Differential Equations
2023cites this paper
Towards Globally Consistent Stochastic Human Motion Prediction via Motion Diffusion
2023cites this paper
Bi-Directional Human Pose Completion Based on RNN and Attention Mechanism
2022cites this paper
MIMO Is All You Need : A Strong Multi-In-Multi-Out Baseline for Video Prediction
2022cites this paper
A constructive deep convolutional network model for analyzing video-to-image sequences
2022cites this paper
Video Anomaly Detection Based on Optical Flow Feature Enhanced Spatio–Temporal Feature Network FusionNet-LSTM-G
2022cites this paper
Recent advances and application of generative adversarial networks in drug discovery, development, and targeting
2022cites this paper
WALDO: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction
2022cites this paper
FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations
2022cites this paper
Human Joint Kinematics Diffusion-Refinement for Stochastic Motion Prediction
2022influential citation
A generic diffusion-based approach for 3D human pose prediction in the wild
2022cites this paper
Drug-protein interaction prediction via variational autoencoders and attention mechanisms
2022cites this paper
Mutually activated residual linear modeling GAN for pose-guided person image generation
2022cites this paper
Learning Uncoupled-Modulation CVAE for 3D Action-Conditioned Human Motion Synthesis
2022cites this paper
Active Patterns Perceived for Stochastic Video Prediction
2022cites this paper
BeLFusion: Latent Diffusion for Behavior-Driven Human Motion Prediction
2022cites this paper
Evaluation of different irrigation methods based on deep evaluate model named IMDEM
2022cites this paper
Diverse Human Motion Prediction via Gumbel-Softmax Sampling from an Auxiliary Space
2022cites this paper
Skeleton-Parted Graph Scattering Networks for 3D Human Motion Prediction
2022cites this paper
Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D Pose Estimation Tracking and Forecasting on a Video Snippet
2022cites this paper
Multi-Objective Diverse Human Motion Prediction with Knowledge Distillation
2022cites this paper
ELMA: Energy-Based Learning for Multi-Agent Activity Forecasting
2022influential citation
Spatial-temporal modeling for prediction of stylized human motion
2022cites this paper
Synthetic Data - what, why and how?
2022cites this paper
Cascaded Siamese Self-supervised Audio to Video GAN
2022cites this paper
Weakly-supervised Action Transition Learning for Stochastic Human Motion Prediction
2022cites this paper
Pose- and Attribute-consistent Person Image Synthesis
2022cites this paper
Motron: Multimodal Probabilistic Human Motion Forecasting
2022cites this paper
Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning
2022cites this paper
Semi-supervised segmentation of echocardiography videos via noise-resilient spatiotemporal semantic calibration and fusion
2022cites this paper
Matching Faces and Attributes Between the Artistic and the Real Domain: the PersonArt Approach
2022cites this paper
Filtered-CoPhy: Unsupervised Learning of Counterfactual Physics in Pixel Space
2022cites this paper
Didn't see that coming: a survey on non-verbal social human behavior forecasting
2022cites this paper
Latent Image Animator: Learning to Animate Images via Latent Space Navigation
2022cites this paper
Research on Imbalanced Data Classification Based on Classroom-Like Generative Adversarial Networks
2022cites this paper
A deep generative approach for crash frequency model with heterogeneous imbalanced data
2022cites this paper
Generative Adversarial Network for Future Hand Segmentation from Egocentric Video
2022cites this paper
A comprehensive review on GANs for time-series signals
2022cites this paper
Learning Latent Seasonal-Trend Representations for Time Series Forecasting
2022cites this paper
HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical VAE
2022cites this paper
Social Processes: Self-Supervised Meta-Learning over Conversational Groups for Forecasting Nonverbal Social Cues
2022cites this paper