Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics

Published 2013 in Journal of Artificial Intelligence Research

ABSTRACT

The ability to associate images with natural language sentences that describe what is depicted in them is a hallmark of image understanding, and a prerequisite for applications such as sentence-based image search. In analogy to image search, we propose to frame sentence-based image annotation as the task of ranking a given pool of captions. We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events. We introduce a number of systems that perform quite well on this task, even though they are only based on features that can be obtained with minimal supervision. Our results clearly indicate the importance of training on multiple captions per image, and of capturing syntactic (word order-based) and semantic features of these captions. We also perform an in-depth comparison of human and automatic evaluation metrics for this task, and propose strategies for collecting human judgments cheaply and on a very large scale, allowing us to augment our collection with additional relevance judgments of which captions describe which image. Our analysis shows that metrics that consider the ranked list of results for each query image or sentence are significantly more robust than metrics that are based on a single response per query. Moreover, our study suggests that the evaluation of ranking-based image description systems may be fully automated.

PUBLICATION RECORD

Publication year
2013
Venue
Journal of Artificial Intelligence Research
Publication date
2013-05-01
Fields of study
Computer Science
Identifiers
DOI 10.1613/jair.3994
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
2014cited by this paper
Very Deep Convolutional Networks for Large-Scale Image Recognition
2014cited by this paper
Show and tell: A neural image caption generator
2014cited by this paper
Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections
2014cited by this paper
Microsoft COCO: Common Objects in Context
2014cited by this paper
Comparing Automatic Evaluation Measures for Image Description
2014cited by this paper
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
2014cited by this paper
Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
2014cited by this paper
Deep visual-semantic alignments for generating image descriptions
2014cited by this paper
Meteor Universal: Language Specific Translation Evaluation for Any Target Language
2014cited by this paper
Framing image description as a ranking task
2013cited by this paper
An Introduction to Information Retrieval
2013cited by this paper
BabyTalk: Understanding and Generating Simple Image Descriptions
2013cited by this paper
Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search
2012cited by this paper
Midge: Generating Image Descriptions From Computer Vision Detections
2012cited by this paper
Choosing Linguistics over Vision to Describe Images
2012influential reference
Collective Generation of Natural Image Descriptions
2012influential reference
Structured Lexical Similarity via Convolution Kernels on Dependency Trees
2011cited by this paper
Im2Text: Describing Images Using 1 Million Captioned Photographs
2011influential reference
Composing Simple Image Descriptions using Web-scale N-grams
2011cited by this paper
Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability
2011cited by this paper
Corpus-Guided Sentence Generation of Natural Images
2011cited by this paper
Baselines for Image Annotation
2010cited by this paper
Vlfeat: an open and portable library of computer vision algorithms
2010cited by this paper
Collecting Image Annotations Using Amazon’s Mechanical Turk
2010cited by this paper
Large scale image annotation: learning to rank with joint word-image embeddings
2010cited by this paper
Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora
2010cited by this paper
A new approach to cross-modal multimedia retrieval
2010cited by this paper
Every Picture Tells a Story: Generating Sentences from Images
2010cited by this paper
Overview of the Wikipedia Retrieval Task at ImageCLEF 2010
2010cited by this paper
How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
2010cited by this paper
Syntactic and Semantic Kernels for Short Text Pair Categorization
2009cited by this paper
Survey Article: Inter-Coder Agreement for Computational Linguistics
2009cited by this paper
Spatial pyramid matching
2009influential reference
An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems
2009cited by this paper
Speech and Language Processing, 2nd Edition
2008influential reference
Image retrieval: Ideas, influences, and trends of the new age
2008cited by this paper
1 On what it means to see , and what we can do about it
2008influential reference
Automatic Image Annotation Using Auxiliary Text Information
2008cited by this paper
Tree Kernels for Semantic Role Labeling
2008cited by this paper
A Discriminative Kernel-Based Approach to Rank Images from Text Queries
2008cited by this paper
A discriminatively trained, multiscale, deformable part model
2008cited by this paper
A comparison of statistical significance tests for information retrieval evaluation
2007cited by this paper
Workshop on Shared Tasks and Comparative Evaluation in Natural Language Generation Position Papers
2007cited by this paper
Text Analysis for Automatic Image Annotation
2007cited by this paper
The Pascal Visual Object Classes Challenge 2006 ( VOC 2006 ) Results
2006cited by this paper
The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems
2006cited by this paper
Semantic Kernels for Text Classification Based on Topological Measures of Feature Similarity
2006influential reference
A Correlation Approach for Automatic Image Annotation
2006cited by this paper
Kernel Methods for Pattern Analysis
2004cited by this paper
Canonical Correlation Analysis: An Overview with Application to Learning Methods
2004influential reference
A Statistical Approach to Texture Classification from Single Images
2004cited by this paper
ROUGE: A Package for Automatic Evaluation of Summaries
2004cited by this paper
Distinctive Image Features from Scale-Invariant Keypoints
2004cited by this paper
A Model for Learning the Semantics of Pictures
2003cited by this paper
Modeling annotated data
2003cited by this paper
Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics
2003cited by this paper
A Systematic Comparison of Various Statistical Alignment Models
2003cited by this paper
Matching Words and Pictures
2003cited by this paper
Kernel independent component analysis
2003cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002influential reference
Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition
2000influential reference
Variants of the Borda count method for combining ranked classifier hypotheses
2000cited by this paper
Book Reviews: WordNet: An Electronic Lexical Database
1999influential reference
Conceptual framework for indexing visual information at multiple levels
1999cited by this paper
Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms
1998cited by this paper
An Information-Theoretic Definition of Similarity
1998cited by this paper
An algorithm for suffix stripping
1997cited by this paper
The Mathematics of Statistical Machine Translation: Parameter Estimation
1993cited by this paper
Finding Structure in Time
1990cited by this paper
Analyzing the Subject of a Picture: A Theoretical Approach
1986cited by this paper
Content Analysis: An Introduction to Its Methodology
1980cited by this paper
Logic and conversation
1975cited by this paper
A Coefficient of Agreement for Nominal Scales
1960cited by this paper
Relations Between Two Sets of Variates
1936influential reference
The Design of Experiments
1936cited by this paper

CITED BY

Mathematical Frameworks in Image Captioning: A Comprehensive Survey and Real-Time Processing Analysis
2026cites this paper
Improving Fine-Grained Understanding for Retrieval in Human Motion and Text
2026cites this paper
Efficient Multimodal Generative AI Model Towards Frugal Image Captioning Using Deep Learning and Attention Mechanism
2026cites this paper
OACI: Object-aware contextual integration for image captioning
2026cites this paper
Training-Free Self-Correction for Multimodal Masked Diffusion Models
2026cites this paper
Using Deep Learning to Generate Semantically Correct Hindi Captions
2026influential citation
Multistage Nonuniformity Correction Pipeline for Single-Frame Infrared Images Based on Hybrid High-Order Directional and Low-Rank Prior Information
2026cites this paper
Generative image steganography based on mapping-guided stable diffusion with enhanced robustness
2026cites this paper
Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis
2026influential citation
A review of generative coverless image steganography based on diffusion models
2026cites this paper
LRGD: Low-Rank Guided Diffusion for Robust Image Transmission in Semantic Communication
2026cites this paper
Self-Purification Mitigates Backdoors in Multimodal Diffusion Language Models
2026cites this paper
Construction site fall hazard identification and automated captioning using adapted vision-language models
2026cites this paper
Privacy-preserving image captioning using virtual photon-limited imaging and federated learning
2026cites this paper
Decoding digital nomadism: A descriptive audit of landscape representations on Nomad List using generative AI
2026cites this paper
KECAN: knowledge-enhanced cross-modal alignment network for ophthalmic report generation
2025cites this paper
Redemption Score: A Multi-Modal Evaluation Framework for Image Captioning via Distributional, Perceptual, and Linguistic Signal Triangulation
2025cites this paper
Dual-stage pixel transformer with enhanced visual context for image captioning
2025cites this paper
Augmented decoding method using semantic diverse beam search for language generation model
2025cites this paper
Learning from Oblivion: Predicting Knowledge Overflowed Weights via Retrodiction of Forgetting
2025cites this paper
Key region Semantic information Augmented Transformer for Image Captioning
2025cites this paper
UnMA-CapSumT: Unified and Multi-Head attention-driven caption summarization transformer
2025cites this paper
Emergent Natural Language with Communication Games for Improving Image Captioning Capabilities without Additional Data
2025cites this paper
Multi-view multi-label canonical correlation analysis for cross-modal multimedia retrieval
2025cites this paper
Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?
2025cites this paper
Dormant key: Unlocking universal adversarial control in text-to-image models
2025cites this paper
MIRACLE: Multimodal Information Retrieval via a Combined In-Memory Processing and Content Addressable Memory Approach
2025cites this paper
LCMF: Lightweight Cross-Modality Mambaformer for Embodied Robotics VQA
2025cites this paper
TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models
2025cites this paper
Towards Fault-Aware Image Captioning: A Review on Integrating Facial Expression Recognition (FER) and Object Detection
2025cites this paper
Improving Arabic Image Captioning with Vision-Language Models
2025cites this paper
Domain Randomization for Object Detection in Manufacturing Applications Using Synthetic Data: A Comprehensive Study
2025cites this paper
Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models
2025cites this paper
VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval
2025cites this paper
Whitened CLIP as a Likelihood Surrogate of Images and Captions
2025cites this paper
PBC-Transformer: Interpreting Poultry Behavior Classification Using Image Caption Generation Techniques
2025cites this paper
Enhancing image–text matching through multi-level semantic consistency alignment
2025cites this paper
JEEM: Vision-Language Understanding in Four Arabic Dialects
2025cites this paper
Bidirectional dense connected image caption based on transformer
2025cites this paper
An Efficient CNN-LSTM Based Framework for Improved Image Captioning
2025cites this paper
Redemption Score: An Evaluation Framework to Rank Image Captions While Redeeming Image Semantics and Language Pragmatics
2025cites this paper
TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP
2025cites this paper
Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation
2025cites this paper
Is Perturbation-Based Image Protection Disruptive to Image Editing?
2025influential citation
Efficiency Robustness of Dynamic Deep Learning Systems
2025cites this paper
A Comprehensive Survey on Image Captioning Techniques
2025cites this paper
Towards Universal & Efficient Model Compression via Exponential Torque Pruning
2025cites this paper
(Almost) Free Modality Stitching of Foundation Models
2025influential citation
CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset
2025cites this paper
CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models
2025cites this paper
AGIC: Attention-Guided Image Captioning to Improve Caption Relevance
2025influential citation
Dataset Creation for Visual Entailment using Generative AI
2025cites this paper
Compression-enhanced Three-Pass Protocol for secure and bandwidth-efficient image transmission
2025cites this paper
MCoCa: Towards fine-grained multimodal control in image captioning
2025cites this paper
Ara-Pic: A Framework for Enhancing Arabic Cultural Representation in AI-Generated Images
2025cites this paper
Pull It Together: Reducing the Modality Gap in Contrastive Learning
2025cites this paper
Adaptive Language-Aware Image Reflection Removal Network
2025cites this paper
VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analy
2025influential citation
Federated Self-Supervised Learning Based on Prototypes Clustering Contrastive Learning for Internet of Vehicles Applications
2025cites this paper
Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models
2025cites this paper
GenView++: Unifying Adaptive View Generation and Quality-Driven Supervision for Contrastive Representation Learning
2025cites this paper
VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
2025influential citation
Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions
2025cites this paper
A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates
2025cites this paper
On the Limitations of Vision-Language Models in Understanding Image Transforms
2025cites this paper
A Risk Identification Method for Power Operation Scenarios Using Image Caption and Semantic Text Similarity Analysis
2025cites this paper
Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval
2025cites this paper
Hyperbolic Safety-Aware Vision-Language Models
2025cites this paper
HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims
2025cites this paper
Semantic consistency learning for unsupervised multi-modal person re-identification
2025cites this paper
SeaCap: Multi-Sight Embedding and Alignment for One-Stage Image Captioner
2025cites this paper
Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions
2025cites this paper
Tiny TR-CAP: A novel small-scale benchmark dataset for general-purpose image captioning tasks
2025cites this paper
Variance-Aware Loss Scheduling for Multimodal Alignment in Low-Data Settings
2025influential citation
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
2025influential citation
Text-Guided Image Invariant Feature Learning for Robust Image Watermarking
2025cites this paper
De-Confounding Feature Fusion Transformer Network for Image Captioning in Assistive Navigation Applications for the Visually Impaired
2025cites this paper
Lie Detector: Unified Backdoor Detection via Cross-Examination Framework
2025cites this paper
Bias-Variance Decomposition of the Mean-Square Deviation of the LMS Algorithm: Transient and Steady-State Analysis
2025cites this paper
Image captioning deep learning model using ResNet50 encoder and hybrid LSTM–GRU decoder optimized with beam search
2025cites this paper
Unlocking Accurate Diagnoses: The Impact of Deep Learning on Radiology
2025cites this paper
Language Model for Large-Text Transmission in Noisy Quantum Communications
2025cites this paper
ABE: A Unified Framework for Robust and Faithful Attribution-Based Explainability
2025cites this paper
Bi-CANet: Bidirectional Contextual Alignment with Multimodal Decoder for Image Captioning
2025influential citation
A comprehensive survey on automatic image captioning-deep learning techniques, datasets and evaluation parameters
2025cites this paper
Attention-based transformer model for Arabic image captioning
2025cites this paper
Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models
2025cites this paper
UICD: A new dataset and approach for urdu image captioning
2025cites this paper
DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models
2025cites this paper
MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping
2025cites this paper
Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline
2025cites this paper
An Efficient System for Automatic Map Storytelling: A Case Study on Historical Maps
2025cites this paper
HI-Captioner: End-to-end image captioning based on hierarchical multi-scale encoding and cross-modal interactive decoding
2025cites this paper
A Watermark for Auto-Regressive Image Generation Models
2025influential citation
Evaluation of Multilingual Image Captioning: How far can we get with CLIP models?
2025influential citation
EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations
2025cites this paper
Image Caption Generator Using CNN and LSTM
2025cites this paper
Data Transformation Strategies to Remove Heterogeneity
2025cites this paper
L-CLIPScore: a Lightweight Embedding-based Captioning Metric for Evaluating and Training
2025cites this paper
Privacy-Preserving Image Captioning with Partial Encryption and Deep Learning
2025cites this paper