Grounding Action Descriptions in Videos

Michaela Regneri,Marcus Rohrbach,Dominikus Wetzel,Stefan Thater,B. Schiele,Manfred Pinkal

Published 2013 in Transactions of the Association for Computational Linguistics

ABSTRACT

Recent work has shown that the integration of visual information into text-based models can substantially improve model predictions, but so far only visual information extracted from static images has been used. In this paper, we consider the problem of grounding sentences describing actions in visual information extracted from videos. We present a general purpose corpus that aligns high quality videos with multiple natural language descriptions of the actions portrayed in the videos, together with an annotation of how similar the action descriptions are to each other. Experimental results demonstrate that a text-based model of similarity between actions improves substantially when combined with visual information from videos depicting the described actions.

PUBLICATION RECORD

Publication year
2013
Venue
Transactions of the Association for Computational Linguistics
Publication date
2013-03-31
Fields of study
Computer Science
Identifiers
DOI 10.1162/tacl_a_00207
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Measuring Word Meaning in Context
2013cited by this paper
Efficiently Scaling up Crowdsourced Video Annotation
2012cited by this paper
Detecting Visual Text
2012cited by this paper
A database for fine grained activity detection of cooking activities
2012influential reference
Grounded Models of Semantic Representation
2012cited by this paper
Improving Video Activity Recognition using Object Recognition and Text Mining
2012cited by this paper
Script Data for Attribute-Based Recognition of Composite Activities
2012influential reference
Extracting aspects of determiner meaning from dialogue in a virtual world environment
2011cited by this paper
Distributional semantics from text and images
2011cited by this paper
Word Meaning in Context: A Simple and Effective Vector Model
2011cited by this paper
Language Models for Semantic Extraction and Filtering in Video Action Recognition
2011cited by this paper
Action recognition by dense trajectories
2011cited by this paper
Collecting Highly Parallel Data for Paraphrase Evaluation
2011cited by this paper
From frequency to meaning
2010cited by this paper
Visual Information in Semantic Representation
2010cited by this paper
From Frequency to Meaning: Vector Space Models of Semantics
2010cited by this paper
Using Closed Captions as Supervision for Video Activity Recognition
2010cited by this paper
Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos
2009cited by this paper
Graded Word Sense Assignment
2009cited by this paper
Automatic learning and generation of social behavior from collective human gameplay
2009influential reference
Investigations on Word Senses and Word Usages
2009cited by this paper
Vector-based Models of Semantic Composition
2008cited by this paper
Movie/Script: Alignment and Parsing of Video and Text Transcription
2008cited by this paper
Learning the abstract motion semantics of verbs from captioned videos
2008influential reference
A Model of Grounded Language Acquisition: Sensorimotor Features Improve Lexical and Grammatical Learning.
2005cited by this paper
The PASCAL Recognising Textual Entailment Challenge
2005cited by this paper
Labeling images with a computer game
2004cited by this paper
Grounding language in action
2002cited by this paper
Combining Feature Norms and Text Data with Topic Models
year unknowncited by this paper
Combining Feature Norms and Text Data with Topic Models
year unknowncited by this paper

CITED BY

Efficient Pre-Trained Semantics Refinement for Video Temporal Grounding
2026cites this paper
Distinguishing semantically similar queries in temporal video grounding via LLM-generated query
2026cites this paper
Reproducibility Companion Paper: Maskable Retentive Network for Video Moment Retrieval
2025cites this paper
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
2025cites this paper
Object-Centric Framework for Video Moment Retrieval
2025cites this paper
Fine-Grained Modality Relation-Aware Network for Video Moment Retrieval
2025cites this paper
Lightweight Relational Proposal Network with Dual-Branch Distillation for Video Moment Retrieval
2025cites this paper
Uncertainty-quantified Rollout Policy Adaptation for Unlabelled Cross-domain Temporal Grounding
2025cites this paper
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
2025cites this paper
Enhancing video temporal grounding with large language model-based data augmentation
2025cites this paper
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
2025cites this paper
Object-Shot Enhanced Grounding Network for Egocentric Video
2025cites this paper
Learning to Diversify for Robust Video Moment Retrieval
2025cites this paper
TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos
2025influential citation
Gaming for Boundary: Elastic Localization for Frame-Supervised Video Moment Retrieval
2025cites this paper
Universal Video Temporal Grounding with Generative Multi-modal Large Language Models
2025cites this paper
Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
2025cites this paper
HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
2025influential citation
Diff-DETR: Differential Attention Transformer for Video Moment Retrieval and Highlight Detection
2025cites this paper
1 + 1 > 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning
2025cites this paper
VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning
2025influential citation
TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs
2025cites this paper
Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search
2025cites this paper
Video Retrieval Architecture Based on Contrastive Learning and Gaussianized Label Mechanisms
2025cites this paper
Learning Activity View-invariance Under Extreme Viewpoint Changes via Curriculum Knowledge Distillation
2025cites this paper
Temporal video segmentation with natural language using text-video cross attention and Bayesian order-priors
2025cites this paper
Fine-Grained Alignment and Interaction for Video Grounding With Cross-Modal Semantic Hierarchical Graph
2025cites this paper
Select and assign: anchor-guided proposals for temporal sentence grounding
2025cites this paper
Lyric-Aware Karaoke Background Video Selection Using Large Language Models and Moment Retrieval
2025cites this paper
Tracking and Understanding Object Transformations
2025cites this paper
Boosting Temporal Sentence Grounding via Causal Inference
2025cites this paper
A Survey on Video Temporal Grounding With Multimodal Large Language Model
2025cites this paper
Player-Centric Multimodal Prompt Generation for Large Language Model Based Identity-Aware Basketball Video Captioning
2025cites this paper
TIME: Temporal-Sensitive Multi-Dimensional Instruction Tuning and Robust Benchmarking for Video-LLMs
2025cites this paper
Sim-DETR: Unlock DETR for Temporal Sentence Grounding
2025cites this paper
Learning unified patterns of multimodalities for video temporal grounding
2025cites this paper
Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
2025cites this paper
VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations
2025cites this paper
DF-Net: A Dual Fusion Network for Accurate Video Temporal Grounding
2025influential citation
Multimodal Video Moment Retrieval: A Survey
2025cites this paper
Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding
2025cites this paper
TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries
2025cites this paper
Cross-Modal Representation Shift Refinement for Point-supervised Video Moment Retrieval
2025cites this paper
Correlation-guided calibration of query dependency for video temporal grounding
2025cites this paper
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
2025influential citation
Video Temporal Grounding with Multi-Model Collaborative Learning
2025cites this paper
Camscribe: Enhanced Dashcam Video Descriptions Through Multimodal Spatiotemporal and Object Detection for Autonomous Vehicles
2025cites this paper
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
2025cites this paper
Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization
2025cites this paper
LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection
2025cites this paper
HAN: Korean Heritage Augmented Narrative Visual-Language Description Dataset
2025cites this paper
Moment Quantization for Video Temporal Grounding
2025cites this paper
Survey of Dense Video Captioning: Techniques, Resources, and Future Perspectives
2025cites this paper
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation
2025cites this paper
DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos
2025cites this paper
Multi-order Chebyshev-based composite relation graph matching network for temporal sentence grounding in videos
2025cites this paper
The Devil is in the Spurious Correlations: Boosting Moment Retrieval with Dynamic Learning
2025cites this paper
SA-DETR: Span Aware Detection Transformer for Moment Retrieval
2025influential citation
Predicting Implicit Arguments in Procedural Video Instructions
2025cites this paper
LLM-powered Query Expansion for Enhancing Boundary Prediction in Language-driven Action Localization
2025cites this paper
Disentangling Inter- and Intra-Video Relations for Multi-Event Video-Text Retrieval and Grounding
2025cites this paper
Improving Video Moment Retrieval via LLM Augmented Nested Adapter
2025cites this paper
Generative Frame Sampler for Long Video Understanding
2025cites this paper
Interactive Content Retrieval in Egocentric Videos Based on Vague Semantic Queries
2025cites this paper
MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering
2025cites this paper
Context Consistency Learning via Sentence Removal for Semi-Supervised Video Paragraph Grounding
2025cites this paper
MS-DETR: Towards Effective Video Moment Retrieval and Highlight Detection by Joint Motion-Semantic Learning
2025cites this paper
Data Transformation Strategies to Remove Heterogeneity
2025cites this paper
Text-Guided Visual Representation Optimization for Sensor-Acquired Video Temporal Grounding
2025cites this paper
Length matters: Length-Aware Transformer for temporal sentence grounding
2025influential citation
Video Understanding by Design: How Datasets Shape Architectures and Insights
2025cites this paper
NeMo: Needle in a Montage for Video-Language Understanding
2025cites this paper
Multi-hierarchical semantic graph learning for video moment retrieval
2025cites this paper
Surveillance Video-and-Language Understanding: From Small to Large Multimodal Models
2025cites this paper
SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding
2025cites this paper
When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions
2025cites this paper
Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning
2025cites this paper
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding
2024cites this paper
Parameterized multi-perspective graph learning network for temporal sentence grounding in videos
2024cites this paper
Tarsier: Recipes for Training and Evaluating Large Video Description Models
2024cites this paper
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
2024cites this paper
Language-based machine perception: linguistic perspectives on the compilation of captioning datasets
2024cites this paper
Attractive Multimodal Instructions, Describing Easy and Engaging Recipe Blogs
2024cites this paper
Gazing After Glancing: Edge Information Guided Perception Network for Video Moment Retrieval
2024cites this paper
Transferable dual multi-granularity semantic excavating for partially relevant video retrieval
2024cites this paper
Diversifying Query: Region-Guided Transformer for Temporal Sentence Grounding
2024cites this paper
Towards a Complete Benchmark on Video Moment Localization
2024influential citation
Towards Robust Temporal Activity Localization Learning with Noisy Labels
2024cites this paper
Deep Dependency Networks and Advanced Inference Schemes for Multi-Label Classification
2024influential citation
Step Differences in Instructional Video
2024cites this paper
MovieChat+: Question-Aware Sparse Memory for Long Video Question Answering
2024cites this paper
SketchQL: Video Moment Querying with a Visual Query Interface
2024cites this paper
Temporally Grounding Instructional Diagrams in Unconstrained Videos
2024cites this paper
MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval
2024cites this paper
Transferable Video Moment Localization by Moment-Guided Query Prompting
2024cites this paper
Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding
2024influential citation
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
2024cites this paper
Graph-based Dense Event Grounding with relative positional encoding
2024cites this paper
Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild
2024cites this paper
FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding
2024influential citation