Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Bryan A. Plummer,Liwei Wang,Christopher M. Cervantes,Juan C. Caicedo,J. Hockenmaier,Svetlana Lazebnik

Published 2015 in International Journal of Computer Vision

ABSTRACT

The Flickr30k dataset has become a standard benchmark for sentence-based image description. This paper presents Flickr30k Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes. Such annotations are essential for continued progress in automatic image description and grounded language understanding. They enable us to define a new benchmark for localization of textual entity mentions in an image. We present a strong baseline for this task that combines an image-text embedding, detectors for common objects, a color classifier, and a bias towards selecting larger objects. While our baseline rivals in accuracy more complex state-of-the-art models, we show that its gains cannot be easily parlayed into improvements on such tasks as image-sentence retrieval, thus underlining the limitations of current methods and the need for further research.

PUBLICATION RECORD

Publication year
2015
Venue
International Journal of Computer Vision
Publication date
2015-05-19
Fields of study
Computer Science
Identifiers
DOI 10.1007/s11263-016-0965-7 arXiv 1505.04870
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
2016influential reference
Solving VIsual Madlibs with Multiple Cues
2016cited by this paper
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
2016cited by this paper
Top-Down Neural Attention by Excitation Backprop
2016influential reference
Structured Matching for Phrase Localization
2016influential reference
Mind's eye: A recurrent visual representation for image caption generation
2015cited by this paper
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
2015cited by this paper
Multimodal Convolutional Neural Networks for Matching Image and Sentence
2015cited by this paper
Phrase-based Image Captioning
2015cited by this paper
Fast R-CNN
2015influential reference
Exploring Models and Data for Image Question Answering
2015cited by this paper
Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question
2015cited by this paper
Image retrieval using scene graphs
2015influential reference
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
2015cited by this paper
VQA: Visual Question Answering
2015cited by this paper
Natural Language Object Retrieval
2015influential reference
Grounding of Textual Phrases in Images by Reconstruction
2015influential reference
RNN Fisher Vectors for Action Recognition and Image Annotation
2015cited by this paper
Learning Deep Structure-Preserving Image-Text Embeddings
2015influential reference
Generation and Comprehension of Unambiguous Object Descriptions
2015influential reference
Language Models for Image Captioning: The Quirks and What Works
2015cited by this paper
Visual Madlibs: Fill in the blank Image Generation and Question Answering
2015influential reference
What Are You Talking About? Text-to-Image Coreference
2014influential reference
Show and tell: A neural image caption generator
2014influential reference
Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections
2014influential reference
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
2014influential reference
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
2014influential reference
A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input
2014cited by this paper
Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
2014influential reference
Long-term recurrent convolutional networks for visual recognition and description
2014cited by this paper
Edge Boxes: Locating Object Proposals from Edges
2014cited by this paper
ReferItGame: Referring to Objects in Photographs of Natural Scenes
2014cited by this paper
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
2014influential reference
Linking People in Videos with "Their" Names Using Coreference Resolution
2014cited by this paper
Deep visual-semantic alignments for generating image descriptions
2014cited by this paper
From captions to visual concepts and back
2014influential reference
Microsoft COCO: Common Objects in Context
2014influential reference
Very Deep Convolutional Networks for Large-Scale Image Recognition
2014influential reference
Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation
2014influential reference
Learning a Recurrent Visual Representation for Image Caption Generation
2014cited by this paper
Selective Search for Object Recognition
2013cited by this paper
Framing image description as a ranking task
2013cited by this paper
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
2013cited by this paper
BabyTalk: Understanding and Generating Simple Image Descriptions
2013cited by this paper
A Sentence Is Worth a Thousand Pixels
2013cited by this paper
Bringing Semantics into Focus Using Visual Abstraction
2013cited by this paper
Distributed Representations of Words and Phrases and their Compositionality
2013cited by this paper
Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics
2013cited by this paper
Parsing with Compositional Vector Grammars
2013cited by this paper
Crowdsourcing Annotations for Visual Object Detection
2012influential reference
Detecting Visual Text
2012cited by this paper
A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics
2012influential reference
Indoor Segmentation and Support Inference from RGBD Images
2012cited by this paper
Im2Text: Describing Images Using 1 Million Captioned Photographs
2011influential reference
Baby Talk: Understanding and Generating Image Descriptions
2011influential reference
The Pascal Visual Object Classes (VOC) Challenge
2010influential reference
Every Picture Tells a Story: Generating Sentences from Images
2010influential reference
Improving the Fisher Kernel for Large-Scale Image Classification
2010cited by this paper
I2T: Image Parsing to Text Description
2010influential reference
Collecting Image Annotations Using Amazon’s Mechanical Turk
2010influential reference
Cross-Caption Coreference Resolution for Automatic Image Understanding
2010cited by this paper
ImageNet: A large-scale hierarchical image database
2009cited by this paper
Utility data annotation with Amazon Mechanical Turk
2008influential reference
Et al
2008cited by this paper
The PASCAL Visual Object Classes Challenge
2006cited by this paper
The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems
2006cited by this paper
The Pascal Visual Object Classes Challenge 2006 ( VOC 2006 ) Results
2006cited by this paper
A Machine Learning Approach to Coreference Resolution of Noun Phrases
2001cited by this paper
Using Decision Trees for Coreference Resolution
1995cited by this paper
Relations Between Two Sets of Variates
1936cited by this paper

CITED BY

Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval
2026cites this paper
Exploring generative artificial intelligence: a comprehensive guide
2026cites this paper
Aligning Forest and Trees in Images and Long Captions for Visually Grounded Understanding
2026cites this paper
PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval
2026cites this paper
Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment
2026cites this paper
Modality augmentation and task-aware dual-modal LoRAs for multi-task multimodal federated learning
2026cites this paper
A Two-Stage Globally-Diverse Adversarial Attack for Vision-Language Pre-training Models
2026cites this paper
Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning
2026cites this paper
Time Series, Vision, and Language: Exploring the Limits of Alignment in Contrastive Representation Spaces
2026cites this paper
Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought
2026cites this paper
MiroFlow: Towards High-Performance and Robust Open-Source Agent Framework for General Deep Research Tasks
2026cites this paper
Fine-Grained Multimodal Alignment for Image-Text Retrieval via Graph Learning
2026influential citation
Toward Enhancing Representation Learning in Federated Multi-Task Settings
2026cites this paper
Multi-scale feature and historical attention-based cross-modal image–text matching model
2026cites this paper
BARE: Towards Bias-Aware and Reasoning-Enhanced One-Tower Visual Grounding
2026cites this paper
Reasoning text-to-image retrieval with large language models and digital twin representations
2026cites this paper
ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding
2026cites this paper
TextME: Bridging Unseen Modalities Through Text Descriptions
2026cites this paper
ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge
2026influential citation
Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis
2026cites this paper
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
2026cites this paper
Half-Truths Break Similarity-Based Retrieval
2026cites this paper
3D-DRES: Detailed 3D Referring Expression Segmentation
2026cites this paper
TrajTok: Learning Trajectory Tokens enables better Video Understanding
2026cites this paper
OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport
2026cites this paper
TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding
2026influential citation
Open-Text Aerial Detection: A Unified Framework For Aerial Visual Grounding And Detection
2026cites this paper
TADS: Task-Aware Data Selection for Multi-Task Multimodal Pre-Training
2026cites this paper
MM-OpenFGL: A Comprehensive Benchmark for Multimodal Federated Graph Learning
2026cites this paper
OACI: Object-aware contextual integration for image captioning
2026cites this paper
Pixels to prose: A comprehensive survey of image captioning techniques with deep learning and generative artificial intelligence
2026cites this paper
Exploiting Shared Adversarial Features for Dynamic Attacks in Large Vision-Language Models
2026cites this paper
GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation
2026cites this paper
Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models
2026cites this paper
Omniscient bottom-up double-stream symmetric network for image captioning
2026cites this paper
UEval: A Benchmark for Unified Multimodal Generation
2026cites this paper
Spectral Imbalance Causes Forgetting in Low-Rank Continual Adaptation
2026influential citation
Enhancing Open-Vocabulary Object Detection through Multi-Level Fine-Grained Visual-Language Alignment
2026cites this paper
Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs
2026cites this paper
Look-Ahead and Look-Back Flows: Training-Free Image Generation with Trajectory Smoothing
2026cites this paper
MTRAG: Multi-Target Referring and Grounding via Hybrid Semantic-Spatial Integration
2026cites this paper
Continual-NExT: A Unified Comprehension And Generation Continual Learning Framework
2026cites this paper
TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking
2026cites this paper
How Do Inpainting Artifacts Propagate to Language?
2026cites this paper
SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport
2026cites this paper
TINCLIP: Improving compositional reasoning of CLIP via textual inversion with no
2026cites this paper
Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction
2026cites this paper
NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries
2026cites this paper
ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion
2026influential citation
Prompt-Aware Adapter: Learning Adaptive Visual Tokens for Multimodal Large Language Models
2026cites this paper
Not All Attention is Needed: Parameter and Computation Efficient Tuning for Multi-modal Large Language Models via Effective Attention Skipping
2026cites this paper
A comprehensive survey on deep learning approaches for image captioning: a systematic review
2026cites this paper
MentalBlackboard: Evaluating Spatial Visualization via Mathematical Transformations
2026cites this paper
PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention
2026cites this paper
BLUEPRINT Rebuilding a Legacy: Multimodal Retrieval for Complex Engineering Drawings and Documents
2026cites this paper
Arbitrary Ratio Feature Compression via Next Token Prediction
2026cites this paper
CLIP-Map: Structured Matrix Mapping for Parameter-Efficient CLIP Compression
2026cites this paper
OpenMAG: A Comprehensive Benchmark for Multimodal-Attributed Graph
2026cites this paper
Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model
2026cites this paper
Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction
2026cites this paper
LION: A Clifford Neural Paradigm for Multimodal-Attributed Graph Learning
2026cites this paper
Individual and Common Attack: Enhancing Transferability in VLP Models Through Modal Feature Exploitation
2026cites this paper
Beyond One and Two Tower: Cross-Modal Consensus Learning for Image-Text Retrieval
2026cites this paper
Invisible Backdoor Attack With Siamese Tuning on Pre-Trained Vision-Language Models
2026cites this paper
Cross-multi-modal seamless training for image captioning
2026cites this paper
CoGA: A Collaborative Gray-Box Adversarial Attack for Multimodal Language Models
2026cites this paper
FedMAB: adaptive multimodal federated learning with multi-armed bandits
2026cites this paper
GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models
2026cites this paper
EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models
2025cites this paper
Unified Multimodal Discrete Diffusion
2025cites this paper
Unicorn: Text-Only Data Synthesis for Vision Language Model Training
2025cites this paper
Strategic Application of Prompt Engineering in Multi-Modal Large Language Models
2025cites this paper
VisualQuest: A Diverse Image Dataset for Evaluating Visual Recognition in LLMs
2025cites this paper
MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX
2025cites this paper
Re-Aligning Language to Visual Objects with an Agentic Workflow
2025cites this paper
GroundingMate: Aiding Object Grounding for Goal-Oriented Vision-and-Language Navigation
2025cites this paper
Event-Driven Attention Network: A Cross-Modal Framework for Efficient Image-Text Retrieval in Mass Gathering Events
2025influential citation
CAFE: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning
2025influential citation
Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching
2025cites this paper
Web Artifact Attacks Disrupt Vision Language Models
2025cites this paper
TULIP: Towards Unified Language-Image Pretraining
2025cites this paper
Faster Parameter-Efficient Tuning with Token Redundancy Reduction
2025cites this paper
ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion
2025cites this paper
Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning
2025cites this paper
Grounded Chain-of-Thought for Multimodal Large Language Models
2025cites this paper
Survey of Adversarial Robustness in Multimodal Large Language Models
2025cites this paper
Federated Continual Instruction Tuning
2025cites this paper
STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation
2025cites this paper
Text-Guided Image Invariant Feature Learning for Robust Image Watermarking
2025cites this paper
Diffusion Models are Zero-Shot Generative Text-Vision Retrievers
2025cites this paper
FruitMMBench: A Multi-modal Benchmark for Fruit Quality Assessment
2025cites this paper
Inference Calibration of Vision-Language Foundation Models for Zero-Shot and Few-Shot Learning
2025cites this paper
Tip the Scales: Achieving Balance in Adversarial Examples Across Modalities
2025influential citation
Minimizing Disparities between Real and Pseudo Queries for Unsupervised Visual Grounding
2025cites this paper
Cross-Modal Semantic Relations Enhancement With Graph Attention Network for Image-Text Matching
2025influential citation
xVLM2Vec: Adapting LVLM-based embedding models to multilinguality using Self-Knowledge Distillation
2025cites this paper
Robust CLIP-Guided Deep Thinking: A Two-Stage Optimization Strategy for Enhancing Adversarial Robustness and Reliability in LVLMs
2025cites this paper
Try Before You Buy: Solving Multi-Model Complex Tasks by Model Competitions
2025cites this paper
Improving Open-Ended Referring Expression Comprehension via Dual-Language Constraints
2025cites this paper
MncCap: Mining Neural Composition for Zero-shot Image Captioning via Text-only Training
2025cites this paper