End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering

Youngjae Yu,Hyungjin Ko,Jongwook Choi,Gunhee Kim

Published 2016 in Computer Vision and Pattern Recognition

ABSTRACT

We propose a high-level concept word detector that can be integrated with any video-to-language models. It takes a video as input and generates a list of concept words as useful semantic priors for language generation models. The proposed word detector has two important properties. First, it does not require any external knowledge sources for training. Second, the proposed word detector is trainable in an end-to-end manner jointly with any video-to-language models. To effectively exploit the detected words, we also develop a semantic attention mechanism that selectively focuses on the detected concept words and fuse them with the word encoding and decoding in the language model. In order to demonstrate that the proposed approach indeed improves the performance of multiple video-to-language tasks, we participate in all the four tasks of LSMDC 2016 [18]. Our approach has won three of them, including fill-in-the-blank, multiple-choice test, and movie retrieval.

PUBLICATION RECORD

Publication year
2016
Venue
Computer Vision and Pattern Recognition
Publication date
2016-10-10
Fields of study
Computer Science
Identifiers
DOI 10.1109/CVPR.2017.347 arXiv 1610.02947
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Video Fill in the Blank with Merging LSTMs
2016cited by this paper
Temporal Tessellation for Video Annotation and Summarization
2016cited by this paper
Image Captioning and Visual Question Answering Based on Attributes and Their Related External Knowledge
2016cited by this paper
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
2016influential reference
Temporal Tessellation: A Unified Approach for Video Analysis
2016cited by this paper
Image Captioning and Visual Question Answering Based on Attributes and External Knowledge
2016cited by this paper
Learning Language-Visual Embedding for Movie Understanding with Natural-Language
2016cited by this paper
Layer Normalization
2016cited by this paper
Captioning Images with Diverse Objects
2016cited by this paper
Image Captioning with Semantic Attention
2016influential reference
Movie Description
2016influential reference
Sequence to Sequence -- Video to Text
2015cited by this paper
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks
2015cited by this paper
The Long-Short Story of Movie Description
2015influential reference
Deep Residual Learning for Image Recognition
2015cited by this paper
What Value Do Explicit High Level Concepts Have in Vision to Language Problems?
2015cited by this paper
MovieQA: Understanding Stories in Movies through Question-Answering
2015cited by this paper
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
2015cited by this paper
Describing Videos by Exploiting Temporal Structure
2015cited by this paper
Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework
2015cited by this paper
Dropout: a simple way to prevent neural networks from overfitting
2014cited by this paper
ImageNet Large Scale Visual Recognition Challenge
2014cited by this paper
Long-term recurrent convolutional networks for visual recognition and description
2014cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
CIDEr: Consensus-based image description evaluation
2014cited by this paper
From captions to visual concepts and back
2014cited by this paper
Coherent Multi-sentence Video Description with Variable Level of Detail
2014cited by this paper
Translating Videos to Natural Language Using Deep Recurrent Neural Networks
2014cited by this paper
Translating Video Content to Natural Language Descriptions
2013cited by this paper
YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition
2013cited by this paper
Distributed Representations of Words and Phrases and their Compositionality
2013influential reference
Maxout Networks
2013cited by this paper
A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching
2013cited by this paper
Collecting Highly Parallel Data for Paraphrase Evaluation
2011cited by this paper
Understanding the difficulty of training deep feedforward neural networks
2010cited by this paper
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
2005cited by this paper
ROUGE: A Package for Automatic Evaluation of Summaries
2004cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002cited by this paper
Long Short-Term Memory
1997influential reference
Bidirectional recurrent neural networks
1997influential reference
and as an in
year unknowncited by this paper

CITED BY

Visualization methods for explainable medical imaging diagnosis: A survey
2026cites this paper
A Direct Zero-Shot Indoor Scene Recognition Method Based on Visual Question Answering
2025cites this paper
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval
2025cites this paper
Needle in a haystack: Coarse-to-fine alignment network for moment retrieval from large-scale video collections
2025cites this paper
Temporal Modeling With Frozen Vision–Language Foundation Models for Parameter-Efficient Text–Video Retrieval
2025cites this paper
Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning
2025cites this paper
Leveraging Auxiliary Information in Text-to-Video Retrieval: A Review
2025cites this paper
Joint multi-grained similarity contrastive learning for video-text retrieval
2025cites this paper
ELIOT: Zero-Shot Video-Text Retrieval through Relevance-Boosted Captioning and Structural Information Extraction
2025cites this paper
Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval
2025cites this paper
LSECA: local semantic enhancement and cross aggregation for video-text retrieval
2024cites this paper
Video-Language Alignment via Spatio-Temporal Graph Transformer
2024cites this paper
Video captioning – a survey
2024cites this paper
Learning with Noisy Correspondence
2024cites this paper
Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval
2024cites this paper
Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network
2024cites this paper
Language-aware Visual Semantic Distillation for Video Question Answering
2024cites this paper
Streaming Detection of Queried Event Start
2024cites this paper
Memory-Based Augmentation Network for Video Captioning
2024cites this paper
Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey
2024cites this paper
MICap: A Unified Model for Identity-Aware Movie Descriptions
2024cites this paper
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
2024cites this paper
Disentangle and denoise: Tackling context misalignment for video moment retrieval
2024cites this paper
Hierarchical Video-Moment Retrieval and Step-Captioning
2023cites this paper
Simple Baselines for Interactive Video Retrieval with Questions and Answers
2023cites this paper
Edit As You Wish: Video Description Editing with Multi-grained Commands
2023cites this paper
Cross-Modal Retrieval: A Systematic Review of Methods and Future Directions
2023cites this paper
Text-video retrieval method based on enhanced self-attention and multi-task learning
2023cites this paper
Concept-Aware Video Captioning: Describing Videos With Effective Prior Information
2023cites this paper
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
2023cites this paper
Cali-NCE: Boosting Cross-modal Video Representation Learning with Calibrated Alignment
2023cites this paper
Experts Collaboration Learning for Continual Multi-Modal Reasoning
2023cites this paper
Learning Action Changes by Measuring Verb-Adverb Textual Relationships
2023cites this paper
Deep sequential collaborative cognition of vision and language based model for video description
2023cites this paper
Step by Step: A Gradual Approach for Dense Video Captioning
2023cites this paper
Edit As You Wish: Video Caption Editing with Multi-grained User Control
2023cites this paper
Video-Text Retrieval by Supervised Sparse Multi-Grained Learning
2023cites this paper
Relation Triplet Construction for Cross-modal Text-to-Video Retrieval
2023cites this paper
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
2023cites this paper
Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods
2023cites this paper
Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network
2023cites this paper
Semantic Collaborative Learning for Cross-Modal Moment Localization
2023cites this paper
Transfer Knowledge from Natural Language to Electrocardiography: Can We Detect Cardiovascular Disease Through Language Models?
2023cites this paper
Expert-guided contrastive learning for video-text retrieval
2023cites this paper
SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text
2023cites this paper
How You Feelin’? Learning Emotions and Mental States in Movie Scenes
2023cites this paper
SCCS: Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment
2023cites this paper
Many Hands Make Light Work: Transferring Knowledge From Auxiliary Tasks for Video-Text Retrieval
2023cites this paper
HOME: 3D Human–Object Mesh Topology-Enhanced Interaction Recognition in Images
2022cites this paper
Dynamic self-attention with vision synchronization networks for video question answering
2022cites this paper
Cross-Lingual Cross-Modal Consolidation for Effective Multilingual Video Corpus Moment Retrieval
2022cites this paper
Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering
2022cites this paper
Automatic Concept Extraction for Concept Bottleneck-based Video Classification
2022cites this paper
Video2Subtitle: Matching Weakly-Synchronized Sequences via Dynamic Temporal Alignment
2022cites this paper
Content-Based Video Big Data Retrieval with Extensive Features and Deep Learning
2022cites this paper
Multi-grained encoding and joint embedding space fusion for video and text cross-modal retrieval
2022cites this paper
CRET: Cross-Modal Retrieval Transformer for Efficient Text-Video Retrieval
2022cites this paper
Guided Graph Attention Learning for Video-Text Matching
2022cites this paper
DoSSIER at MedVidQA 2022: Text-based Approaches to Medical Video Answer Localization Problem
2022cites this paper
MHMS: Multimodal Hierarchical Multimedia Summarization
2022cites this paper
Video Captioning: a comparative review of where we are and which could be the route
2022cites this paper
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
2022cites this paper
Enhanced Video BERT for Fast Video Advertisement Retrieval
2022cites this paper
SQ2SV: Sequential Queries to Sequential Videos retrieval
2022cites this paper
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
2022cites this paper
Multi-task Ranking with User Behaviors for Text-video Search
2022cites this paper
Level-wise aligned dual networks for text–video retrieval
2022cites this paper
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
2022cites this paper
Parallel Pathway Dense Video Captioning With Deformable Transformer
2022cites this paper
Video-Text Representation Learning via Differentiable Weak Temporal Alignment
2022cites this paper
Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval
2022cites this paper
Video2Subtitle: Matching Weakly-Synchronized Sequences via Dynamic Temporal Alignment
2022cites this paper
Coarse-to-fine dual-level attention for video-text cross modal retrieval
2022cites this paper
Cross-language multimodal scene semantic guidance and leap sampling for video captioning
2022cites this paper
Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval
2022cites this paper
Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment
2022cites this paper
Differentiate Visual Features with Guidance Signals for Video Captioning
2022cites this paper
Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval
2022cites this paper
Tree-based Text-Vision BERT for Video Search in Baidu Video Advertising
2022cites this paper
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
2022cites this paper
Differentiated Attention with Multi-modal Reasoning for Video Question Answering
2022cites this paper
Visual and language semantic hybrid enhancement and complementary for video description
2022cites this paper
FeatInter: Exploring fine-grained object features for video-text retrieval
2022cites this paper
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
2022cites this paper
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
2021cites this paper
CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning
2021cites this paper
Bridging Vision and Language from the Video-to-Text Perspective: A Comprehensive Review
2021cites this paper
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
2021cites this paper
SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events
2021cites this paper
SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval
2021cites this paper
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
2021cites this paper
Aligning Source Visual and Target Language Domains for Unpaired Video Captioning
2021cites this paper
VGNMN: Video-grounded Neural Module Networks for Video-Grounded Dialogue Systems
2021cites this paper
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter
2021cites this paper
Deep Learning Methods for Sign Language Translation
2021cites this paper
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling
2021cites this paper
VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks
2021cites this paper
Fine-grained Cross-modal Alignment Network for Text-Video Retrieval
2021cites this paper
Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering
2021cites this paper
Pairwise VLAD Interaction Network for Video Question Answering
2021influential citation