Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen,Hao Fang,Tsung-Yi Lin,Ramakrishna Vedantam,Saurabh Gupta,Piotr Dollár,C. L. Zitnick

Published 2015 in arXiv.org

ABSTRACT

In this paper we describe the Microsoft COCO Caption dataset and evaluation server. When completed, the dataset will contain over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions will be provided. To ensure consistency in evaluation of automatic caption generation algorithms, an evaluation server is used. The evaluation server receives candidate captions and scores them using several popular metrics, including BLEU, METEOR, ROUGE and CIDEr. Instructions for using the evaluation server are provided.

PUBLICATION RECORD

Publication year
2015
Venue
arXiv.org
Publication date
2015-04-01
Fields of study
Computer Science
Identifiers
arXiv 1504.00325
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Déjà Image-Captions: A Corpus of Expressive Descriptions in Repetition
2015cited by this paper
Phrase-based Image Captioning
2015cited by this paper
Combining Language and Vision with a Multimodal Skip-gram Model
2015cited by this paper
Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections
2014cited by this paper
Show and tell: A neural image caption generator
2014cited by this paper
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
2014cited by this paper
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
2014cited by this paper
AutoCaption: Automatic caption generation for personal photos
2014cited by this paper
Explain Images with Multimodal Recurrent Neural Networks
2014cited by this paper
Long-term recurrent convolutional networks for visual recognition and description
2014cited by this paper
Meteor Universal: Language Specific Translation Evaluation for Any Target Language
2014cited by this paper
Simple Image Description Generator via a Linear Phrase-Based Approach
2014cited by this paper
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
2014cited by this paper
Deep visual-semantic alignments for generating image descriptions
2014cited by this paper
CIDEr: Consensus-based image description evaluation
2014influential reference
From captions to visual concepts and back
2014cited by this paper
Comparing Automatic Evaluation Measures for Image Description
2014cited by this paper
Multimodal Neural Language Models
2014cited by this paper
Nonparametric Method for Data-driven Image Captioning
2014cited by this paper
TreeTalk: Composition and Compression of Trees for Image Descriptions
2014cited by this paper
The Stanford CoreNLP Natural Language Processing Toolkit
2014influential reference
Microsoft COCO: Common Objects in Context
2014cited by this paper
Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world
2014cited by this paper
Learning a Recurrent Visual Representation for Image Caption Generation
2014cited by this paper
Automatic Caption Generation for News Images
2013cited by this paper
Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics
2013influential reference
BabyTalk: Understanding and Generating Simple Image Descriptions
2013cited by this paper
Image Description using Visual Dependency Representations
2013cited by this paper
Collective Generation of Natural Image Descriptions
2012cited by this paper
Distributional Semantics in Technicolor
2012cited by this paper
Choosing Linguistics over Vision to Describe Images
2012cited by this paper
ImageNet classification with deep convolutional neural networks
2012cited by this paper
Midge: Generating Image Descriptions From Computer Vision Detections
2012cited by this paper
Im2Text: Describing Images Using 1 Million Captioned Photographs
2011cited by this paper
Corpus-Guided Sentence Generation of Natural Images
2011cited by this paper
Every Picture Tells a Story: Generating Sentences from Images
2010cited by this paper
ImageNet: A large-scale hierarchical image database
2009cited by this paper
Re-evaluating the Role of Bleu in Machine Translation Research
2006cited by this paper
The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems
2006cited by this paper
ROUGE: A Package for Automatic Evaluation of Summaries
2004cited by this paper
Matching Words and Pictures
2003cited by this paper
A Model for Learning the Semantics of Pictures
2003cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002cited by this paper
Learning the semantics of words and pictures
2001cited by this paper
Long Short-Term Memory
1997cited by this paper
WordNet: A Lexical Database for English
1995cited by this paper

CITED BY

Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model
2026cites this paper
Concept Heterogeneity-aware Representation Steering
2026cites this paper
Dataset and benchmark for captioning images depicting complex construction activities
2026cites this paper
SANEval: Open-Vocabulary Compositional Benchmarks with Failure-mode Diagnosis
2026cites this paper
TechING: Towards Real World Technical Image Understanding via VLMs
2026cites this paper
Underrepresented in Foundation Model Pretraining Data? A One-Shot Probe
2026cites this paper
Semantic Leakage from Image Embeddings
2026cites this paper
Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment
2026cites this paper
See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis
2026influential citation
ReasonEdit: Editing Vision-Language Models using Human Reasoning
2026influential citation
UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models
2026influential citation
UML: uncertainty-aware and mutual learning for noise-robust cross-lingual cross-modal retrieval
2026cites this paper
LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation
2026cites this paper
Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume
2026cites this paper
Revisiting Multi-Task Visual Representation Learning
2026cites this paper
UEval: A Benchmark for Unified Multimodal Generation
2026cites this paper
A Geometry-Aware Efficient Algorithm for Compositional Entropic Risk Minimization
2026cites this paper
Xray-Visual Models: Scaling Vision models on Industry Scale Data
2026cites this paper
3D-DRES: Detailed 3D Referring Expression Segmentation
2026cites this paper
Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization
2026cites this paper
Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions
2026cites this paper
A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation
2026cites this paper
FedUMM: A General Framework for Federated Learning with Unified Multimodal Models
2026cites this paper
Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models
2026cites this paper
Quality Evaluation of AI-Generated Images: Subjective Study and Objective Methodology
2026cites this paper
Reasoning text-to-image retrieval with large language models and digital twin representations
2026cites this paper
MixFusion: A Patch-Level Parallel Serving System for Mixed-Resolution Diffusion Models
2026cites this paper
Generative Engine Optimization: A VLM and Agent Framework for Pinterest Acquisition Growth
2026cites this paper
Reinforced Curriculum Pre-Alignment for Domain-Adaptive VLMs
2026influential citation
Traffic flow information detection under resource limitations based on framework optimization
2026cites this paper
Overcoming the Retrieval Barrier: Indirect Prompt Injection in the Wild for LLM Systems
2026influential citation
Construction site fall hazard identification and automated captioning using adapted vision-language models
2026cites this paper
Unknown Category Classification by Transferring Knowledge From Known
2026cites this paper
Multimodal learning with next-token prediction for large multimodal models
2026cites this paper
ObjEmbed: Towards Universal Multimodal Object Embeddings
2026cites this paper
RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation
2026cites this paper
Emergent Language Symbolic Autoencoder (ELSA) with weak supervision to model hierarchical brain networks.
2026cites this paper
Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs
2026cites this paper
Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
2026cites this paper
CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness
2025cites this paper
Mimic In-Context Learning for Multimodal Tasks
2025cites this paper
Rethinking Natural Language Generation with Layer-Wise Multi-View Decoding
2025cites this paper
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models
2025cites this paper
XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?
2025cites this paper
Perception in Reflection
2025cites this paper
Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator
2025cites this paper
MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX
2025cites this paper
Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models
2025cites this paper
CAFE: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning
2025cites this paper
Unified Multimodal Discrete Diffusion
2025cites this paper
UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation
2025cites this paper
Gemma 3 Technical Report
2025cites this paper
GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers
2025cites this paper
Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model, and Training Strategy
2025cites this paper
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
2025cites this paper
Sparse-Guided Partial Dense for Cross-Modal Remote Sensing Image–Text Retrieval
2025cites this paper
It’s a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data
2025cites this paper
HA-FGOVD: Highlighting Fine-Grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection
2025cites this paper
Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
2025cites this paper
Multimodal artificial intelligence approaches using large language models for expert‐level landslide image analysis
2025cites this paper
HyperCore: The Core Framework for Building Hyperbolic Foundation Models with Comprehensive Modules
2025cites this paper
Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation
2025cites this paper
BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models
2025cites this paper
Aerial Mirage: Unmasking Hallucinations in Large Vision Language Models
2025cites this paper
Dynamic Relation Inference via Verb Embeddings
2025cites this paper
Balanced-Simplified Spatiotemporal Memory Attention for Image Captioning
2025cites this paper
Can Large Vision Language Models Read Maps Like a Human?
2025cites this paper
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
2025cites this paper
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality
2025cites this paper
Towards Explainable Fake Image Detection with Multi-Modal Large Language Models
2025cites this paper
Neurons: Emulating the Human Visual Cortex Improves Fidelity and Interpretability in fMRI-to-Video Reconstruction
2025cites this paper
Spatiotemporal-Aware Visual Captioning using Vision-Language Pre-Training Model
2025cites this paper
FlowTok: Flowing Seamlessly Across Text and Image Tokens
2025cites this paper
DiffMEL: A large-scale difficulty-graded dataset for Multimodal Entity Linking
2025cites this paper
Scaling Laws for Conditional Emergence of Multilingual Image Captioning via Generalization from Translation
2025influential citation
EventLens: Enhancing Visual Commonsense Reasoning by Leveraging Event-Aware Pretraining and Cross-modal Linking
2025cites this paper
RONA: Pragmatically Diverse Image Captioning with Coherence Relations
2025cites this paper
Emotion-Oriented Cross-Modal Prompting and Alignment for Human-Centric Emotional Video Captioning
2025cites this paper
Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models
2025cites this paper
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
2025cites this paper
Task-Agnostic Attacks Against Vision Foundation Models
2025cites this paper
Capability Instruction Tuning: A New Paradigm for Dynamic LLM Routing
2025cites this paper
InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback
2025cites this paper
Co-InsCap: Collaborative Detection and Image Caption for Defective Insulator of High-Speed Railway Catenary
2025cites this paper
Are Large Vision Language Models Good Game Players?
2025cites this paper
Language-Guided Visual Perception Disentanglement for Image Quality Assessment and Conditional Image Generation
2025cites this paper
SuperCap: Multi-resolution Superpixel-based Image Captioning
2025influential citation
LongProLIP: A Probabilistic Vision-Language Model with Long Context Text
2025influential citation
Teaching LMMs for Image Quality Scoring and Interpreting
2025cites this paper
Role Bias in Diffusion Models: Diagnosing and Mitigating through Intermediate Decomposition
2025cites this paper
Improving Multimodal Large Language Models through Combining Resampler and MLP Projections
2025cites this paper
Non-Autoregressive Image Captioning with Multi-Label Classification and Self-Critical Sequence Training
2025influential citation
OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning
2025cites this paper
Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection
2025cites this paper
Hyperbolic Safety-Aware Vision-Language Models
2025cites this paper
Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art
2025influential citation
Scale Efficient Training for Large Datasets
2025cites this paper
Deeply Supervised Flow-Based Generative Models
2025cites this paper
Individual gaze predicts individual scene descriptions
2025cites this paper
Fine-Grained Captioning of Long Videos through Scene Graph Consolidation
2025influential citation