Visual Semantic Search: Retrieving Videos via Complex Textual Queries

Dahua Lin,S. Fidler,Chen Kong,R. Urtasun

Published 2014 in 2014 IEEE Conference on Computer Vision and Pattern Recognition

ABSTRACT

In this paper, we tackle the problem of retrieving videos using complex natural language queries. Towards this goal, we first parse the sentential descriptions into a semantic graph, which is then matched to visual concepts using a generalized bipartite matching algorithm. Our approach exploits object appearance, motion and spatial relations, and learns the importance of each term using structure prediction. We demonstrate the effectiveness of our approach on a new dataset designed for semantic search in the context of autonomous driving, which exhibits complex and highly dynamic scenes with many objects. We show that our approach is able to locate a major portion of the objects described in the query with high accuracy, and improve the relevance in video retrieval.

PUBLICATION RECORD

Publication year
2014
Venue
2014 IEEE Conference on Computer Vision and Pattern Recognition
Publication date
2014-06-01
Fields of study
Computer Science
Identifiers
DOI 10.1109/CVPR.2014.340
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

StereoScan : Dense 3 D Reconstruction in Real-time
2016cited by this paper
What Are You Talking About? Text-to-Image Coreference
2014cited by this paper
A Sentence Is Worth a Thousand Pixels
2013cited by this paper
Parsing with Compositional Vector Grammars
2013cited by this paper
Zero-shot video retrieval using content and concepts
2013cited by this paper
Translating Video Content to Natural Language Descriptions
2013cited by this paper
Robust Monocular Epipolar Flow Estimation
2013cited by this paper
A Joint Model of Language and Perception for Grounded Attribute Learning
2012cited by this paper
Video In Sentences Out
2012cited by this paper
Are we ready for autonomous driving? The KITTI vision benchmark suite
2012influential reference
Indoor Segmentation and Support Inference from RGBD Images
2012cited by this paper
Globally-optimal greedy algorithms for tracking a variable number of objects
2011cited by this paper
StereoScan: Dense 3d reconstruction in real-time
2011cited by this paper
Object Detection with Discriminatively Trained Part Based Models
2010cited by this paper
I2T: Image Parsing to Text Description
2010cited by this paper
Every Picture Tells a Story: Generating Sentences from Images
2010cited by this paper
Concept-Based Video Retrieval
2009cited by this paper
Towards total scene understanding: Classification, annotation and segmentation in an automatic framework
2009cited by this paper
Global data association for multi-object tracking using network flows
2008cited by this paper
Utilizing semantic word similarity measures for video retrieval
2008cited by this paper
The importance of query-concept-mapping for automatic video retrieval
2007cited by this paper
Adding Semantics to Detectors for Video Retrieval
2007cited by this paper
Content-based multimedia information retrieval: State of the art and challenges
2006cited by this paper
Learning structured prediction models : a large margin approach
2005cited by this paper
Joint visual-text modeling for automatic retrieval of multimedia documents
2005cited by this paper
Learning structured prediction models: a large margin approach
2005cited by this paper
Large Margin Methods for Structured and Interdependent Output Variables
2005cited by this paper
Matching Words and Pictures
2003cited by this paper
Video Google: a text retrieval approach to object matching in videos
2003cited by this paper
Shape Matching and Object Recognition Using Shape Contexts
2002cited by this paper
Content-Based Multimedia Information Retrieval
2000cited by this paper

CITED BY

AI Video Retrieval: A Semantic Search & Timestamp Alignment System
2025cites this paper
Exploring Opportunities to Support Novice Visual Artists' Inspiration and Ideation with Generative AI
2025cites this paper
ATARS: An Aerial Traffic Atomic Activity Recognition and Temporal Segmentation Dataset
2025cites this paper
Teleology-Driven Affective Computing: A Causal Framework for Sustained Well-Being
2025cites this paper
DPHANet: Discriminative Parallel and Hierarchical Attention Network for Natural Language Video Localization
2024cites this paper
Structural and Contrastive Guidance Mining for Weakly-Supervised Language Moment Localization
2024cites this paper
Guided Querying over Videos using Autocompletion Suggestions
2024cites this paper
Ordered Atomic Activity for Fine-grained Interactive Traffic Scenario Understanding
2023cites this paper
Semantic Relevance Learning for Video-Query Based Video Moment Retrieval
2023cites this paper
Scene representation using a new two-branch neural network model
2023cites this paper
Action-Slot: Visual Action-Centric Representations for Multi-Label Atomic Activity Recognition in Traffic Scenes
2023influential citation
Semantic Collaborative Learning for Cross-Modal Moment Localization
2023cites this paper
A novel multilabel video retrieval method using multiple video queries and deep hash codes
2022cites this paper
Multiple cross-attention for video-subtitle moment retrieval
2022cites this paper
A Multi-granularity Retrieval System for Natural Language-based Vehicle Retrieval
2022cites this paper
Towards a large-scale person search by vietnamese natural language: dataset and methods
2022cites this paper
Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for Temporal Sentence Grounding
2022cites this paper
Scene Graph Embeddings Using Relative Similarity Supervision
2021cites this paper
VSRNet: End-to-end video segment retrieval with text query
2021cites this paper
Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval
2021cites this paper
Multi-Directional Convolution Networks with Spatial-Temporal Feature Pyramid Module for Action Recognition
2021cites this paper
Connecting Language and Vision for Natural Language-Based Vehicle Retrieval
2021cites this paper
Context-aware network with foreground recalibration for grounding natural language in video
2021cites this paper
Attention feature matching for weakly-supervised video relocalization
2021influential citation
*Improving Natural Language Queries Search and Retrieval through Semantic Image Annotation Understanding
2021cites this paper
Temporal Textual Localization in Video via Adversarial Bi-Directional Interaction Networks
2021cites this paper
Person Tube Retrieval via Language Description
2020cites this paper
Moment Retrieval via Cross-Modal Interaction Networks With Query Reconstruction
2020cites this paper
SYNC—Short, Yet Novel Concise Natural Language Description: Generating a Short Story Sequence of Album Images Using Multimodal Network
2020cites this paper
A Bottom-up Paradigm for Traffic Scene Graph Representation
2020cites this paper
Modality correlation-based video summarization
2020cites this paper
Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language
2020cites this paper
Universal Embeddings for Spatio-Temporal Tagging of Self-Driving Logs
2020cites this paper
Cross-modal video moment retrieval based on visual-textual relationship alignment
2020cites this paper
Frame-Wise Cross-Modal Matching for Video Moment Retrieval
2020cites this paper
Character Grounding and Re-identification in Story of Videos and Text Descriptions
2020cites this paper
vtGraphNet: Learning weakly-supervised scene graph for complex visual grounding
2020cites this paper
Frame-wise Cross-modal Match for Video Moment Retrieval
2020cites this paper
Generating Adjacency Matrix for Video Relocalization
2020cites this paper
Graph Neural Network for Video Relocalization
2020cites this paper
Generating Adjacency Matrix for Video-Query based Video Moment Retrieval
2020cites this paper
Enriching Video Captions With Contextual Text
2020cites this paper
Graph Neural Network for Video-Query based Video Moment Retrieval
2020cites this paper
Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval
2020cites this paper
Evaluation of Text Generation: A Survey
2020cites this paper
Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction
2019cites this paper
Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos
2019cites this paper
Scene graph captioner: Image captioning based on structural visual representation
2019cites this paper
Localizing Natural Language in Videos
2019cites this paper
Semantic Proposal for Activity Localization in Videos via Sentence Query
2019cites this paper
Neural Sequential Phrase Grounding (SeqGROUND)
2019cites this paper
Visual to Text: Survey of Image and Video Captioning
2019cites this paper
VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
2019cites this paper
Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention
2019cites this paper
Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos
2019cites this paper
Relationship Detection Based on Object Semantic Inference and Attention Mechanisms
2019cites this paper
Multi-Stage Cross-modal Interaction Module d ) Moment Retrieval Module q " q # q $ q ) Query
2019cites this paper
SLTFNet: A spatial and language-temporal tensor fusion network for video moment retrieval
2019cites this paper
Visual Understanding through Natural Language
2019cites this paper
A Graph-Based Framework to Bridge Movies and Synopses
2019cites this paper
Automatic Alignment Methods for Visual and Textual Data with Narrative Content
2019influential citation
Fast Video Clip Retrieval Method via Language Query
2019cites this paper
Deep Bayesian Active Learning for Multiple Correct Outputs
2019cites this paper
Learning to detect visual relations
2019cites this paper
Retrieval of Sentence Sequences for an Image Stream via Coherence Recurrent Convolutional Networks
2018cites this paper
MLN: Moment localization Network and Samples Selection for Moment Retrieval
2018cites this paper
Sentence Encoder Video Encoder Frame-Specific Sentence Representation Cross Gating Matching Aggregation Self Interactor Segment Localizer Cross Modal
2018cites this paper
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
2018cites this paper
Vision and Language Learning: From Image Captioning and Visual Question Answering towards Embodied Agents
2018influential citation
Visual Coreference Resolution in Visual Dialog using Neural Module Networks
2018cites this paper
LSTM stack-based Neural Multi-sequence Alignment TeCHnique (NeuMATCH)
2018cites this paper
Learning spoken language through vision
2018cites this paper
Attentive Moment Retrieval in Videos
2018cites this paper
Unsupervised Textual Grounding: Linking Words to Image Concepts
2018cites this paper
Find and Focus: Retrieve and Localize Video Events with Natural Language Queries
2018cites this paper
Illustrate your travel notes: web-based story visualization
2018cites this paper
Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts
2018cites this paper
Semantic Based Video Retrieval System: Survey
2018cites this paper
Cross-modal Moment Localization in Videos
2018cites this paper
A Neural Multi-sequence Alignment TeCHnique (NeuMATCH)
2018cites this paper
Grounding natural language phrases in images and video
2018cites this paper
Where to Play : Retrieval of Video Segments using Natural-Language eries
2018cites this paper
Vision-Based Passenger Activity Analysis System in Public Transport and Bus Stop Areas
2018cites this paper
Temporally Grounding Natural Sentence in Video
2018cites this paper
Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech
2018cites this paper
Self-View Grounding Given a Narrated 360 Degree Video
2018cites this paper
Improving instance search performance in video collections
2017cites this paper
Sentence Directed Video Object Codiscovery
2017cites this paper
The Art of Deep Connection - Towards Natural and Pragmatic Conversational Agent Interactions
2017cites this paper
Word Prior Detection Segmentation Input " The left guy " Image : Query : a guy left the youth Energy
2017cites this paper
Describing human activities in video streams
2017cites this paper
Online Cross-Modal Scene Retrieval by Binary Representation and Semantic Graph
2017cites this paper
Deep Visual-Semantic Alignments for Generating Image Descriptions
2017cites this paper
VSE++: Improved Visual-Semantic Embeddings
2017cites this paper
Generating Descriptions with Grounded and Co-referenced People
2017cites this paper
Harnessing A.I. for Augmenting Creativity: Application to Movie Trailer Creation
2017cites this paper
Localizing Moments in Video with Natural Language
2017influential citation
LxTube - Processing of massive video archives in order to index and search information
2017cites this paper
Probabilistic Semantic Retrieval for Surveillance Videos With Activity Graphs
2017influential citation
Spatio-Temporal Person Retrieval via Natural Language Queries
2017cites this paper