Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Peter Anderson,Xiaodong He,Chris Buehler,Damien Teney,Mark Johnson,Stephen Gould,Lei Zhang

Published 2017 in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

ABSTRACT

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

PUBLICATION RECORD

Publication year
2017
Venue
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Publication date
2017-07-25
Fields of study
Computer Science
Identifiers
DOI 10.1109/CVPR.2018.00636 arXiv 1707.07998
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering
2017cited by this paper
Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge
2017cited by this paper
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
2016influential reference
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2016cited by this paper
Zero-Shot Visual Question Answering
2016cited by this paper
Identity Mappings in Deep Residual Networks
2016cited by this paper
Review Networks for Caption Generation
2016cited by this paper
Language Modeling with Gated Convolutional Networks
2016cited by this paper
Boosting Image Captioning with Attributes
2016cited by this paper
Areas of Attention for Image Captioning
2016cited by this paper
End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering
2016cited by this paper
Self-Critical Sequence Training for Image Captioning
2016influential reference
Hierarchical Question-Image Co-Attention for Visual Question Answering
2016cited by this paper
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
2016cited by this paper
Image Captioning with Semantic Attention
2016influential reference
SPICE: Semantic Propositional Image Caption Evaluation
2016influential reference
Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
2016cited by this paper
Improved Image Captioning via Policy Gradient optimization of SPIDEr
2016cited by this paper
Highway Networks
2015cited by this paper
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
2015cited by this paper
SSD: Single Shot MultiBox Detector
2015cited by this paper
Microsoft COCO Captions: Data Collection and Evaluation Server
2015cited by this paper
VQA: Visual Question Answering
2015influential reference
Visual7W: Grounded Question Answering in Images
2015cited by this paper
Spatial Transformer Networks
2015cited by this paper
Stacked Attention Networks for Image Question Answering
2015cited by this paper
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
2015cited by this paper
Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering
2015cited by this paper
Order-Embeddings of Images and Language
2015influential reference
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
2015cited by this paper
Aligning where to see and what to tell: image caption with region-based attention and scene factorization
2015cited by this paper
What Value Do Explicit High Level Concepts Have in Vision to Language Problems?
2015cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
You Only Look Once: Unified, Real-Time Object Detection
2015cited by this paper
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
2014cited by this paper
From captions to visual concepts and back
2014influential reference
Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
2014cited by this paper
ImageNet Large Scale Visual Recognition Challenge
2014influential reference
Microsoft COCO: Common Objects in Context
2014cited by this paper
GloVe: Global Vectors for Word Representation
2014cited by this paper
CIDEr: Consensus-based image description evaluation
2014cited by this paper
Deep visual-semantic alignments for generating image descriptions
2014cited by this paper
Meteor Universal: Language Specific Translation Evaluation for Any Target Language
2014cited by this paper
Edge Boxes: Locating Object Proposals from Edges
2014cited by this paper
Show and tell: A neural image caption generator
2014influential reference
Long-term recurrent convolutional networks for visual recognition and description
2014cited by this paper
Selective Search for Object Recognition
2013cited by this paper
Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
2012cited by this paper
ADADELTA: An Adaptive Learning Rate Method
2012cited by this paper
Top-Down Versus Bottom-Up Control of Attention in the Prefrontal and Posterior Parietal Cortices
2007cited by this paper
ROUGE: A Package for Automatic Evaluation of Summaries
2004cited by this paper
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
2004cited by this paper
Control of goal-directed and stimulus-driven attention in the brain
2002cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002influential reference
Objects and attention: the state of the art.
2001cited by this paper
Long Short-Term Memory
1997influential reference
Shifting visual attention between objects and locations: evidence from normal and parietal lesion subjects.
1994cited by this paper
Perceptual grouping and attention in visual search for features and for objects.
1982cited by this paper
A feature-integration theory of attention.
1980cited by this paper
and as an in
year unknowncited by this paper

CITED BY

Dual-space intervention for mitigating bias in robust visual question answering
2026cites this paper
Semantic-Aware Remote Sensing Visual Question Answering via Segmentation-Guided Learning
2026cites this paper
Unbiased diagnostic report generation via multi-modal counterfactual inference
2026cites this paper
Focal equilibrium: Bias reshaping for generalizable and robust visual understanding
2026cites this paper
Cross-multi-modal seamless training for image captioning
2026cites this paper
ETV-Attack: Efficient Text-Driven Visual-Variable Adversarial Attacks on Visual Question Answering with Pre-trained Language Models
2026cites this paper
Neighborhood-interference independent graph mechanisms for image-text matching
2026cites this paper
Geo-TCAM: a Thangka captioning method integrating topic modeling with geometry-guided spatial attention
2026cites this paper
Visualization methods for explainable medical imaging diagnosis: A survey
2026cites this paper
SHED Light on Segmentation for Dense Prediction
2026cites this paper
EarthVL: A Progressive Earth Vision-Language Understanding and Generation Framework
2026cites this paper
Medical VLP Model Is Vulnerable: Toward Multimodal Adversarial Attack on Large Medical Vision-Language Models
2026cites this paper
Large Foundation Model Empowered Region-aware Underwater Image Captioning
2026cites this paper
Improving Episodic Few-shot Visual Question Answering via Spatial and Frequency Domain Dual-calibration
2026cites this paper
Global Entity Relationship Enhancement Network for Multimodal Sarcasm Detection
2026cites this paper
Privacy-preserving image captioning using virtual photon-limited imaging and federated learning
2026cites this paper
Visual Knowledge-Enhanced LLaVA for Fine-Grained Multimodal Named Entity Recognition and Grounding
2026influential citation
SGHA-Attack: Semantic-Guided Hierarchical Alignment for Transferable Targeted Attacks on Vision-Language Models
2026cites this paper
SongEcho: Towards Cover Song Generation via Instance-Adaptive Element-wise Linear Modulation
2026cites this paper
NeuroSymb-MRG: Differentiable Abductive Reasoning with Active Uncertainty Minimization for Radiology Report Generation
2026cites this paper
AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models
2026cites this paper
VTFCGNet: a novel cross-modal reasoning network integrating Fourier self-attention and graph attention for visual text question answering
2026cites this paper
Construction site fall hazard identification and automated captioning using adapted vision-language models
2026cites this paper
CMS-UNet: A CNN-Mamba Synergistic Dual-Encoder u-net for lung tumor segmentation in CT images
2026cites this paper
NeSyVQA: Neurosymbolic Visual Question Answering With Knowledge-Enriched Scene Graphs
2026cites this paper
ICQ-TransE: LLM-Enhanced Image-Caption-Question Translating Embeddings for Knowledge-Based Visual Question Answering
2026cites this paper
Few-shot harmful meme detection via self-adaption mixture-of-experts
2026cites this paper
PrQAC : Prompting LLaMA3 with question-aware image captions and answer candidates for knowledge-based VQA
2026cites this paper
SSTrack: Joint scale-aware temporal prompts and spatio-temporal prior transformer for visual object tracking
2026cites this paper
OACI: Object-aware contextual integration for image captioning
2026cites this paper
Beyond One and Two Tower: Cross-Modal Consensus Learning for Image-Text Retrieval
2026influential citation
Question-guided attention and cross-modal alignment for knowledge-based visual question answering
2026cites this paper
HISF: Hierarchical Interactive Semantic Fusion for Multimodal Prompt Learning
2026cites this paper
Mathematical Frameworks in Image Captioning: A Comprehensive Survey and Real-Time Processing Analysis
2026influential citation
Dual-Stream Collaborative Transformer for Image Captioning
2026influential citation
Omniscient bottom-up double-stream symmetric network for image captioning
2026cites this paper
Rewarding Fine-grained Image Captioning with Keyword Group Contrastive
2026cites this paper
Leveraging Textual-Cues for Enhancing Multimodal Sentiment Analysis by Object Recognition
2026cites this paper
DHSA: dual hierarchical semantic alignment for visual question answering
2026cites this paper
Automated Histopathology Report Generation via Pyramidal Feature Extraction and the UNI Foundation Model
2026cites this paper
MedFusionT5: Cross-Modal Attention Boosts Semantic Quality and Reduces Hallucinations in Dental AI.
2026cites this paper
Refined Generation-Based Framework for Consistent and Reliable Visual Question Answering
2026cites this paper
How Do Inpainting Artifacts Propagate to Language?
2026cites this paper
DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation
2025cites this paper
Multimodal artificial intelligence approaches using large language models for expert‐level landslide image analysis
2025cites this paper
Enhanced Video Captioning through Residual and Bottleneck CNNS with LSTM Integration
2025cites this paper
Combination of Phrase Matchings based cross-modal retrieval
2025cites this paper
Dual-visual collaborative enhanced transformer for image captioning
2025cites this paper
Matryoshka Learning With Metric Transfer for Image-Text Matching
2025cites this paper
Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision-Language Tasks
2025cites this paper
Multi-View mutual learning network for multimodal fake news detection
2025cites this paper
Ambiguity-Aware and High-order Relation learning for multi-grained image-text matching
2025cites this paper
Semantic–Spatial Feature Fusion With Dynamic Graph Refinement for Remote Sensing Image Captioning
2025cites this paper
De-Confounding Feature Fusion Transformer Network for Image Captioning in Assistive Navigation Applications for the Visually Impaired
2025influential citation
Fully Semantic Gap Recovery for End-to-End Image Captioning
2025cites this paper
QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning
2025influential citation
Feedback-Enhanced Hallucination-Resistant Vision-Language Model for Real-Time Scene Understanding
2025cites this paper
Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering
2025cites this paper
Visual Question Answering: A Survey of Methods, Datasets, Evaluation, and Challenges
2025influential citation
Concept Pinpoint Eraser for Text-to-image Diffusion Models via Residual Attention Gate
2025cites this paper
AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions
2025cites this paper
Hadamard Product in Deep Learning: Introduction, Advances and Challenges
2025cites this paper
A Review of Automated Report Generation Technologies for Ophthalmic Medical Imaging
2025cites this paper
Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module
2025cites this paper
PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks
2025cites this paper
ChatBEV: A Visual Language Model that Understands BEV Maps
2025cites this paper
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
2025cites this paper
Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching
2025cites this paper
Minimizing Disparities between Real and Pseudo Queries for Unsupervised Visual Grounding
2025influential citation
Astrea: A MOE-based Visual Understanding Model with Progressive Alignment
2025cites this paper
Towards Maximizing Semantic Coverage for Image-Text Retrieval
2025cites this paper
Event-Driven Attention Network: A Cross-Modal Framework for Efficient Image-Text Retrieval in Mass Gathering Events
2025cites this paper
Automated text annotation: a new paradigm for generalizable text-to-image person retrieval
2025cites this paper
Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning
2025cites this paper
Multimodal Aspect-Based Sentiment Analysis with External Knowledge and Multi-granularity Image-Text Features
2025cites this paper
AC-Lite : A Lightweight Image Captioning Model for Low-Resource Assamese Language
2025influential citation
Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening
2025cites this paper
OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels
2025cites this paper
A Review on Vision-Language-Based Approaches: Challenges and Applications
2025influential citation
A Dual-Layer Attention Based CAPTCHA Recognition Approach with Guided Visual Attention
2025influential citation
RSIC-GMamba: A State-Space Model With Genetic Operations for Remote Sensing Image Captioning
2025cites this paper
Deep Reciprocal Learning for Image Captioning
2025cites this paper
Adaptive sparse triple convolutional attention for enhanced visual question answering
2025cites this paper
Improving Domain Generalization for Image Captioning with Unsupervised Prompt Learning
2025cites this paper
Robust data augmentation and contrast learning for debiased visual question answering
2025cites this paper
VTIENet: visual-text information enhancement network for image captioning
2025influential citation
Exploring Semantic Attributes for Image Caption Synthesis in Low-Resource Assamese Language
2025influential citation
Predicate Hierarchies Improve Few-Shot State Classification
2025influential citation
SeaCap: Multi-Sight Embedding and Alignment for One-Stage Image Captioner
2025influential citation
Seeing Beyond: Enhancing Visual Question Answering with Multi-Modal Retrieval
2025cites this paper
Enhanced automatic abstractive document summarization using transformers and sentence grouping
2025cites this paper
Novel cross-dimensional coarse-fine-grained complementary network for image-text matching
2025cites this paper
A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning
2025influential citation
Error-Aware Generative Reasoning for Zero-Shot Visual Grounding
2025cites this paper
A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions
2025cites this paper
SuperCap: Multi-resolution Superpixel-based Image Captioning
2025influential citation
A Continual Learning Approach for Embodied Question Answering with Generative Adversarial Imitation Learning
2025influential citation
Bottleneck-Constrained Contrastive Decoupled Network for Multimodal Aspect-based Sentiment Classification
2025cites this paper
Feature refinement and rethinking attention for remote sensing image captioning
2025cites this paper
GeoSCN: A Novel multimodal self-attention to integrate geometric information on spatial-channel network for fine-grained image captioning
2025cites this paper