Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge

Damien Teney,Peter Anderson,Xiaodong He,Anton van den Hengel

Published 2017 in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

ABSTRACT

Deep Learning has had a transformative impact on Computer Vision, but for all of the success there is also a significant cost. This is that the models and procedures used are so complex and intertwined that it is often impossible to distinguish the impact of the individual design and engineering choices each model embodies. This ambiguity diverts progress in the field, and leads to a situation where developing a state-of-the-art model is as much an art as a science. As a step towards addressing this problem we present a massive exploration of the effects of the myriad architectural and hyperparameter choices that must be made in generating a state-of-the-art model. The model is of particular interest because it won the 2017 Visual Question Answering Challenge. We provide a detailed analysis of the impact of each choice on model performance, in the hope that it will inform others in developing models, but also that it might set a precedent that will accelerate scientific progress in the field.

PUBLICATION RECORD

Publication year
2017
Venue
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Publication date
2017-08-09
Fields of study
Computer Science
Identifiers
DOI 10.1109/CVPR.2018.00444 arXiv 1708.02711
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Inferring and Executing Programs for Visual Reasoning
2017cited by this paper
Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering
2017influential reference
Learning to Reason: End-to-End Module Networks for Visual Question Answering
2017cited by this paper
Bottom-Up and Top-Down Attention for Image Captioning and VQA
2017influential reference
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
2017cited by this paper
MUTAN: Multimodal Tucker Fusion for Visual Question Answering
2017cited by this paper
Learning to Compose Neural Networks for Question Answering
2016cited by this paper
Language Modeling with Gated Convolutional Networks
2016cited by this paper
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
2016influential reference
Zero-Shot Visual Question Answering
2016influential reference
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
2016cited by this paper
Graph-Structured Representations for Visual Question Answering
2016cited by this paper
Identity Mappings in Deep Residual Networks
2016cited by this paper
Hierarchical Question-Image Co-Attention for Visual Question Answering
2016cited by this paper
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2016cited by this paper
Visual Dialog
2016cited by this paper
TrimZero: A Torch Recurrent Module for Efficient Natural Language Processing
2016cited by this paper
Visual question answering: A survey of methods and datasets
2016cited by this paper
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
2016cited by this paper
VQA: Visual Question Answering
2015influential reference
Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images
2015cited by this paper
Neural Module Networks
2015cited by this paper
Visual7W: Grounded Question Answering in Images
2015cited by this paper
Yin and Yang: Balancing and Answering Binary Visual Questions
2015cited by this paper
Stacked Attention Networks for Image Question Answering
2015cited by this paper
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
2015cited by this paper
Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering
2015cited by this paper
Highway Networks
2015cited by this paper
ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering
2015cited by this paper
Compositional Memory for Visual Question Answering
2015cited by this paper
Deep Residual Learning for Image Recognition
2015influential reference
Microsoft COCO: Common Objects in Context
2014cited by this paper
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
2014cited by this paper
Towards a Visual Turing Challenge
2014cited by this paper
GloVe: Global Vectors for Word Representation
2014cited by this paper
From captions to visual concepts and back
2014cited by this paper
Show and tell: A neural image caption generator
2014cited by this paper
ADADELTA: An Adaptive Learning Rate Method
2012cited by this paper
Et al
2008cited by this paper
Zipf's law and the Internet
2002cited by this paper
and as an in
year unknowninfluential reference

CITED BY

CSAM: Capsule spatial attention mask network for visual question answering
2026cites this paper
(GRAVITY) Graph-Based Reasoning With Attention and Visual Information Using Transformers for Yielding Answers
2025influential citation
Adaptive sparse triple convolutional attention for enhanced visual question answering
2025cites this paper
Hadamard Product in Deep Learning: Introduction, Advances and Challenges
2025cites this paper
OPeMer: One-Shot LLM Prompting and Execution-Driven Multimodal Explainable Reasoning
2025cites this paper
A Systematic Comparison of Text and Image Encoders for Visual Question Answering: From RNN to LLM-Based Representations
2025cites this paper
Non-Autoregressive Multimodal Machine Translation
2025cites this paper
ASAM: Asynchronous self-attention model for visual question answering
2025cites this paper
BERT-VQA: Visual Question Answering on Plots
2025cites this paper
Manager: Aggregating Insights From Unimodal Experts in Two-Tower VLMs and MLLMs
2025cites this paper
In Defense of Character-Level Answer Generation Methods for Text-based Visual Question Answering
2025cites this paper
SAFFNet: self-attention based on Fourier frequency domain filter network for visual question answering
2025cites this paper
IP-VQA Dataset: Empowering Precision Agriculture with Autonomous Insect Pest Management through Visual Question Answering
2025cites this paper
Advanced Visual and Textual Co-context Aware Attention Network with Dependent Multimodal Fusion Block for Visual Question Answering
2024cites this paper
Image Captioning via Dynamic Path Customization
2024cites this paper
Instruction Makes a Difference
2024cites this paper
Unraveling the Black Box: A Review of Explainable Deep Learning Healthcare Techniques
2024cites this paper
ViOCRVQA: novel benchmark dataset and VisionReader for visual question answering by understanding Vietnamese text in images
2024cites this paper
Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling
2024cites this paper
Multimodal Rationales for Explainable Visual Question Answering
2024cites this paper
Bridging the Cross-Modality Semantic Gap in Visual Question Answering
2024cites this paper
Enhanced Textual Feature Extraction for Visual Question Answering: A Simple Convolutional Approach
2024cites this paper
A Transformer-Based Approach for Effective Visual Question Answering
2024cites this paper
Towards bias-aware visual question answering: Rectifying and mitigating comprehension biases
2024cites this paper
Multimodal attention-driven visual question answering for Malayalam
2024cites this paper
Enhanced Visual Question Answering: A Comparative Analysis and Textual Feature Extraction Via Convolutions
2024cites this paper
A Comprehensive Survey on Visual Question Answering Datasets and Algorithms
2024cites this paper
An Improved Medical Visual Question Answering Model Based on CLIP and BERT
2024influential citation
Enhancing Visual Question Answering through Bi-Modal Feature Fusion: Performance Analysis
2024cites this paper
LRCN: Layer-residual Co-Attention Networks for visual question answering
2024cites this paper
ArabicQuest: Enhancing Arabic Visual Question Answering with LLM Fine-Tuning
2024cites this paper
Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning
2024cites this paper
Image captioning System Using LSTM and VGG16
2024cites this paper
Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey
2024cites this paper
Convincing Rationales for Visual Question Answering Reasoning
2024cites this paper
PGCL: Prompt guidance and self-supervised contrastive learning-based method for Visual Question Answering
2024cites this paper
Graph convolutional network for difficulty-controllable visual question generation
2023cites this paper
Multi-Granularity Cross-Attention Network for Visual Question Answering
2023cites this paper
Image Processing Based Intelligent Mini Robotic Face Recognition System
2023cites this paper
Positional Attention Guided Transformer-Like Architecture for Visual Question Answering
2023cites this paper
A Critical Analysis of Benchmarks, Techniques, and Models in Medical Visual Question Answering
2023cites this paper
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
2023cites this paper
Integrating multimodal features by a two-way co-attention mechanism for visual question answering
2023influential citation
Multi-Branch Distance-Sensitive Self-Attention Network for Image Captioning
2023cites this paper
Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual Question Answering
2023cites this paper
Enhancing visual question answering with a two‐way co‐attention mechanism and integrated multimodal features
2023cites this paper
Generating Context-Aware Natural Answers for Questions in 3D Scenes
2023cites this paper
ViCLEVR: a visual reasoning dataset and hybrid multimodal fusion model for visual question answering in Vietnamese
2023cites this paper
Knowing What it is: Semantic-Enhanced Dual Attention Transformer
2023cites this paper
Multi-modal spatial relational attention networks for visual question answering
2023cites this paper
Deep Residual Weight-Sharing Attention Network With Low-Rank Attention for Visual Question Answering
2023cites this paper
Multi-stage reasoning on introspecting and revising bias for visual question answering
2023cites this paper
VTQAGen: BART-based Generative Model For Visual Text Question Answering
2023cites this paper
MPCCT: Multimodal vision-language learning paradigm with context-based compact Transformer
2023cites this paper
Composed Image Retrieval via Cross Relation Network With Hierarchical Aggregation Transformer
2023cites this paper
OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement
2023cites this paper
Dual-feature collaborative relation-attention networks for visual question answering
2023cites this paper
The multi-modal fusion in visual question answering: a review of attention mechanisms
2023cites this paper
Visual Question Answering: A Survey on Techniques and Common Trends in Recent Literature
2023cites this paper
Attention-Based Methods For Audio Question Answering
2023cites this paper
Generic Attention-model Explainability by Weighted Relevance Accumulation
2023cites this paper
Dual-decoder transformer network for answer grounding in visual question answering
2023cites this paper
EVJVQA CHALLENGE: MULTILINGUAL VISUAL QUESTION ANSWERING
2023cites this paper
Nested Attention Network with Graph Filtering for Visual Question and Answering
2023cites this paper
Learning to Agree on Vision Attention for Visual Commonsense Reasoning
2023cites this paper
Learning visual question answering on controlled semantic noisy labels
2023cites this paper
Improving visual question answering for bridge inspection by pre‐training with external data of image–text pairs
2023cites this paper
SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage
2023cites this paper
Neural Textual Features Composition for CBIR
2023cites this paper
VAQA: Visual Arabic Question Answering
2023cites this paper
OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese
2023cites this paper
UNIMO-3: Multi-granularity Interaction for Vision-Language Representation Learning
2023cites this paper
ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
2023cites this paper
Improving Selective Visual Question Answering by Learning from Your Peers
2023cites this paper
Unsupervised Dual Modality Prompt Learning for Facial Expression Recognition
2023cites this paper
Co-attention graph convolutional network for visual question answering
2023cites this paper
PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese
2023cites this paper
LOIS: Looking Out of Instance Semantics for Visual Question Answering
2023cites this paper
Leveraging Graph-based Cross-modal Information Fusion for Neural Sign Language Translation
2022influential citation
Question-Driven Multiple Attention(DQMA) Model for Visual Question Answer
2022cites this paper
JGRCAN: A Visual Question Answering Co-Attention Network via Joint Grid-Region Features
2022influential citation
Modality Eigen-Encodings Are Keys to Open Modality Informative Containers
2022cites this paper
Multimodal Summarization Ph.D. Thesis Proposal
2022cites this paper
Toward Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline
2022cites this paper
A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective
2022cites this paper
A Multi-level Mesh Mutual Attention Model for Visual Question Answering
2022influential citation
3DVQA: Visual Question Answering for 3D Environments
2022cites this paper
AMAM: An Attention-based Multimodal Alignment Model for Medical Visual Question Answering
2022cites this paper
Region Collaborative Network for Detection-Based Vision-Language Understanding
2022cites this paper
DoRO: Disambiguation of Referred Object for Embodied Agents
2022cites this paper
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
2022cites this paper
Fine-grained label learning in object detection with weak supervision of captions
2022cites this paper
EBMs vs. CL: Exploring Self-Supervised Visual Pretraining for Visual Question Answering
2022cites this paper
Predicting is not Understanding: Recognizing and Addressing Underspecification in Machine Learning
2022cites this paper
Guiding Visual Question Answering with Attention Priors
2022cites this paper
GLIPv2: Unifying Localization and Vision-Language Understanding
2022cites this paper
Ques-to-Visual Guided Visual Question Answering
2022cites this paper
Learning to Ask Informative Sub-Questions for Visual Question Answering
2022cites this paper
Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning
2022influential citation
Path-Wise Attention Memory Network for Visual Question Answering
2022cites this paper