Visual Question Generation as Dual Task of Visual Question Answering

Yikang Li,Nan Duan,Bolei Zhou,Xiao Chu,Wanli Ouyang,Xiaogang Wang

Published 2017 in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

ABSTRACT

Visual question answering (VQA) and visual question generation (VQG) are two trending topics in the computer vision, but they are usually explored separately despite their intrinsic complementary relationship. In this paper, we propose an end-to-end unified model, the Invertible Question Answering Network (iQAN), to introduce question generation as a dual task of question answering to improve the VQA performance. With our proposed invertible bilinear fusion module and parameter sharing scheme, our iQAN can accomplish VQA and its dual task VQG simultaneously. By jointly trained on two tasks with our proposed dual regularizes (termed as Dual Training), our model has a better understanding of the interactions among images, questions and answers. After training, iQAN can take either question or answer as input, and output the counterpart. Evaluated on the CLEVR and VQA2 datasets, our iQAN improves the top-1 accuracy of the prior art MUTAN VQA method by 1.33% and 0.88% (absolute increase) respectiely. We also show that our proposed dual training framework can consistently improve model performances of many popular VQA architectures.1

PUBLICATION RECORD

Publication year
2017
Venue
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Publication date
2017-09-21
Fields of study
Computer Science
Identifiers
DOI 10.1109/CVPR.2018.00640 arXiv 1709.07192
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

On the CLEVR and VQA2 datasets, iQAN improves MUTAN top-1 accuracy by 1.33% and 0.88% absolute, respectively.
Confidence 0.98

박진우 (dztg5apj7m) extraction뀨 (7c402c1b98) reviewB (s683577b42) reviewAK (4715169a40) review
After training, iQAN can accept either a question or an answer as input and produce the corresponding counterpart.
Confidence 0.93

박진우 (dztg5apj7m) extraction뀨 (7c402c1b98) reviewB (s683577b42) reviewAK (4715169a40) review
Dual Training jointly learns visual question answering and visual question generation to model interactions among images, questions, and answers more effectively.
Confidence 0.95

박진우 (dztg5apj7m) extraction뀨 (7c402c1b98) reviewB (s683577b42) reviewAK (4715169a40) review
iQAN combines visual question answering and visual question generation into a single invertible architecture built around an invertible bilinear fusion module and a parameter sharing scheme.
Confidence 0.97

박진우 (dztg5apj7m) extraction뀨 (7c402c1b98) reviewB (s683577b42) reviewAK (4715169a40) review

CONCEPTS

clevr and vqa2 datasets
datasets

The benchmark datasets used to evaluate the model on synthetic reasoning and natural-image visual question answering.

Aliases: CLEVR, VQA2

박진우 (dztg5apj7m) extraction뀨 (7c402c1b98) reviewB (s683577b42) reviewAK (4715169a40) review
dual training
training framework

A joint optimization setup that trains question answering and question generation together as paired tasks.

박진우 (dztg5apj7m) extraction뀨 (7c402c1b98) reviewB (s683577b42) reviewAK (4715169a40) review
invertible bilinear fusion module
module

A bilinear fusion component designed to support invertible mapping inside the network.

박진우 (dztg5apj7m) extraction뀨 (7c402c1b98) reviewB (s683577b42) reviewAK (4715169a40) review
iqan
model

The Invertible Question Answering Network used to connect question answering and question generation in one model.

Aliases: Invertible Question Answering Network

박진우 (dztg5apj7m) extraction뀨 (7c402c1b98) reviewB (s683577b42) reviewAK (4715169a40) review
mutan
baseline model

A bilinear fusion based visual question answering architecture used as the comparison baseline.

Aliases: MUTAN VQA method

박진우 (dztg5apj7m) extraction뀨 (7c402c1b98) reviewB (s683577b42) reviewAK (4715169a40) review
parameter sharing scheme
scheme

A design that reuses parameters between the answering and generation paths in the model.

박진우 (dztg5apj7m) extraction뀨 (7c402c1b98) reviewB (s683577b42) reviewAK (4715169a40) review
visual question answering
task

A task that predicts an answer from an image and a question.

Aliases: VQA

박진우 (dztg5apj7m) extraction뀨 (7c402c1b98) reviewB (s683577b42) reviewAK (4715169a40) review
visual question generation
task

A task that generates a question from an image and an answer.

Aliases: VQG

박진우 (dztg5apj7m) extraction뀨 (7c402c1b98) reviewB (s683577b42) reviewAK (4715169a40) review

REFERENCES

Question Answering and Question Generation as Dual Tasks
2017cited by this paper
Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks
2017cited by this paper
Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering
2017cited by this paper
MUTAN: Multimodal Tucker Fusion for Visual Question Answering
2017influential reference
Scene Graph Generation from Objects, Phrases and Region Captions
2017cited by this paper
Multi-level Attention Networks for Visual Question Answering
2017cited by this paper
Creativity: Generating Diverse Questions Using Variational Autoencoders
2017cited by this paper
ViP-CNN: Visual Phrase Guided Convolutional Neural Network
2017cited by this paper
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2016influential reference
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
2016cited by this paper
Hierarchical Question-Image Co-Attention for Visual Question Answering
2016cited by this paper
Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus
2016cited by this paper
Towards Automatic Generation of Question Answer Pairs from Images
2016cited by this paper
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
2016cited by this paper
Generating Natural Questions About an Image
2016cited by this paper
Dual Learning for Machine Translation
2016cited by this paper
Automatic Generation of Grounded Visual Questions
2016cited by this paper
Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images
2015cited by this paper
Microsoft COCO Captions: Data Collection and Evaluation Server
2015cited by this paper
VQA: Visual Question Answering
2015cited by this paper
Skip-Thought Vectors
2015cited by this paper
Where to Look: Focus Regions for Visual Question Answering
2015cited by this paper
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
2015cited by this paper
Simple Baseline for Visual Question Answering
2015cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
CIDEr: Consensus-based image description evaluation
2014cited by this paper
Towards a Visual Turing Challenge
2014cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
Show and tell: A neural image caption generator
2014cited by this paper
Automation of Question Generation From Sentences
2011cited by this paper
Dense Point Trajectories by GPU-Accelerated Large Displacement Optical Flow
2010cited by this paper
Et al
2008cited by this paper
METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments
2007cited by this paper
ROUGE: A Package for Automatic Evaluation of Summaries
2004cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002cited by this paper
NLTK: The Natural Language Toolkit
2002cited by this paper
Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition
2000cited by this paper
Book Reviews: Foundations of Statistical Natural Language Processing
1999cited by this paper
Policy Gradient Methods for Reinforcement Learning with Function Approximation
1999cited by this paper

CITED BY

Questions Beyond Pixels: Integrating Commonsense Knowledge in Visual Question Generation for Remote Sensing
2026cites this paper
Temporal consistency-aware text-to-motion generation
2026cites this paper
Inferential and Commonsense Visual Question Generation
2025cites this paper
GISedu-GPT: a large language model framework with prior knowledge for GIS education question bank generation
2025cites this paper
Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding
2025cites this paper
VQA support to Arabic Language Learning Educational Tool
2025cites this paper
ExVQA: a novel stacked attention networks with extended long short-term memory model for visual question answering
2025cites this paper
Performance vs. Complexity Comparative Analysis of Multimodal Bilinear Pooling Fusion Approaches for Deep Learning-Based Visual Arabic-Question Answering Systems
2025cites this paper
Enabling cognitive and unified similarity-based difficulty ranking mechanisms for AQG on multimedia content
2025cites this paper
Can LLMs Ask Good Questions?
2025cites this paper
DiagramQG: A Dataset for Generating Concept-Focused Questions from Diagrams
2024cites this paper
Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey
2024cites this paper
Learning to Ask Denotative and Connotative Questions for Knowledge-based VQA
2024cites this paper
DQG: Database Question Generation for Exact Text-based Image Retrieval
2024cites this paper
ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos
2024cites this paper
Knowledge-Aware Visual Question Generation for Remote Sensing Images
2024cites this paper
Chain-of-Exemplar: Enhancing Distractor Generation for Multimodal Educational Question Generation
2024cites this paper
Ask Questions With Double Hints: Visual Question Generation With Answer-Awareness and Region-Reference
2024cites this paper
Visual Question Answer System for Skeletal Image Using Radiology Images in the Healthcare Domain Based on Visual and Textual Feature Extraction Techniques
2024cites this paper
Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens
2024cites this paper
Synthesizing Conversations from Unlabeled Documents using Automatic Response Segmentation
2024cites this paper
LOVA3: Learning to Visual Question Answering, Asking and Assessment
2024cites this paper
Can We Learn Question, Answer, and Distractors All from an Image? A New Task for Multiple-choice Visual Question Answering
2024influential citation
Video Question Generation for Dynamic Changes
2024cites this paper
Deferred Continuous Batching in Resource-Efficient Large Language Model Serving
2024cites this paper
IndiFoodVQA: Advancing Visual Question Answering and Reasoning with a Knowledge-Infused Synthetic Data Generation Pipeline
2024cites this paper
A Survey on Neural Question Generation: Methods, Applications, and Prospects
2024cites this paper
ConVQG: Contrastive Visual Question Generation with Multimodal Guidance
2024cites this paper
Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation
2024cites this paper
Diverse Visual Question Generation Based on Multiple Objects Selection
2024cites this paper
ArtQuest: Countering Hidden Language Biases in ArtVQA
2024cites this paper
A Dual Reinforcement Learning Framework for Weakly Supervised Phrase Grounding
2024cites this paper
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
2023cites this paper
Storytelling with Image Data: A Systematic Review and Comparative Analysis of Methods and Tools
2023cites this paper
Automatic question generation: a review of methodologies, datasets, evaluation metrics, and applications
2023cites this paper
A Proposal-Free One-Stage Framework for Referring Expression Comprehension and Generation via Dense Cross-Attention
2023cites this paper
From text to multimodal: a survey of adversarial example generation in question answering systems
2023cites this paper
Weakly Supervised Referring Expression Grounding via Dynamic Self-Knowledge Distillation
2023cites this paper
Deconfounded Visual Question Generation with Causal Inference
2023cites this paper
Chatting Makes Perfect - Chat-based Image Retrieval
2023cites this paper
Graph convolutional network for difficulty-controllable visual question generation
2023cites this paper
From Text to Multimodal: A Comprehensive Survey of Adversarial Example Generation in Question Answering Systems
2023cites this paper
VIGC: Visual Instruction Generation and Correction
2023cites this paper
Generating Questions via Unexploited OCR Texts: Prompt-Based Data Augmentation for TextVQA
2023cites this paper
Visual Question Generation Answering (VQG-VQA) using Machine Learning Models
2023cites this paper
Weakly Supervised Referring Expression Grounding via Target-Guided Knowledge Distillation
2023cites this paper
Weakly Supervised Visual Question Answer Generation
2023cites this paper
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!
2023cites this paper
Chatting Makes Perfect: Chat-based Image Retrieval
2023cites this paper
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
2023cites this paper
Artificial intelligence to advance Earth observation: a perspective
2023cites this paper
HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images
2023cites this paper
Visual Question Generation From Remote Sensing Images
2023cites this paper
PPM: A Privacy-Preserving Framework for Intelligent Medical Diagnosis Systems
2023cites this paper
Natural Language Generation Meets Data Visualization: Vis-to-Text and its Duality with Text-to-Vis
2023cites this paper
Learning by Asking Questions for Knowledge-Based Novel Object Recognition
2022cites this paper
Look, Read and Ask: Learning to Ask Questions by Reading Text in Images
2022cites this paper
Multimodal research in vision and language: A review of current and emerging trends
2022cites this paper
Improving Biomedical Information Retrieval with Neural Retrievers
2022cites this paper
Chinese Neural Question Generation: Augmenting Knowledge into Multiple Neural Encoders
2022cites this paper
K-VQG: Knowledge-aware Visual Question Generation for Common-sense Acquisition
2022cites this paper
A Brief Overview of Universal Sentence Representation Methods: A Linguistic View
2022cites this paper
Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation
2022cites this paper
Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning
2022cites this paper
All You May Need for VQA are Image Captions
2022cites this paper
Learning to Answer Visual Questions From Web Videos
2022cites this paper
Multimodal feature fusion and exploitation with dual learning and reinforcement learning for recipe generation
2022cites this paper
Knowledge-Based Visual Question Generation
2022cites this paper
DualGraph: Improving Semi-supervised Graph Classification via Dual Contrastive Learning
2022cites this paper
Inferential Visual Question Generation
2022cites this paper
LFKQG: A Controlled Generation Framework with Local Fine-tuning for Question Generation over Knowledge Bases
2022cites this paper
Few-Shot Visual Question Generation: A Novel Task and Benchmark Datasets
2022influential citation
Closed-book Question Generation via Contrastive Learning
2022cites this paper
Adversarial and Safely Scaled Question Generation
2022cites this paper
Why Did the Chicken Cross the Road? Rephrasing and Analyzing Ambiguous Questions in VQA
2022cites this paper
Effective Generation of Visual Questions
2022cites this paper
A survey of deep learning-based visual question answering
2021cites this paper
Hybrid Reasoning Network for Video-based Commonsense Captioning
2021cites this paper
Multi-Turn Video Question Generation via Reinforced Multi-Choice Attention Network
2021cites this paper
Goal-Driven Visual Question Generation from Radiology Images
2021cites this paper
Graph Discovery for Visual Test Generation
2021cites this paper
What the Role is vs. What Plays the Role: Semi-Supervised Event Argument Extraction via Dual Question Answering
2021cites this paper
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
2021cites this paper
Chabbiimen at VQA-Med 2021: Visual Generation of Relevant Natural Language Questions from Radiology Images for Anomaly Detection
2021cites this paper
Neural-Symbolic Solver for Math Word Problems with Auxiliary Tasks
2021cites this paper
NUIG-DSI’s submission to The GEM Benchmark 2021
2021cites this paper
Learning to Generate Visual Questions with Noisy Supervision
2021influential citation
End-to-End Video Question-Answer Generation With Generator-Pretester Network
2021cites this paper
Attention-based Visual Question Generation
2021cites this paper
Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering
2021cites this paper
Cross-Modal Generative Augmentation for Visual Question Answering
2021cites this paper
Multimodal Humor Dataset: Predicting Laughter tracks for Sitcoms
2021cites this paper
Guiding Visual Question Generation
2021cites this paper
Multiple Objects-Aware Visual Question Generation
2021cites this paper
Can deep learning solve a preschool image understanding problem?
2021cites this paper
A Review on Question Generation from Natural Language Text
2021cites this paper
Diversity and Consistency: Exploring Visual Question-Answer Pair Generation
2021cites this paper
Exploring deep learning for intelligent image retrieval
2021cites this paper
C3VQG: category consistent cyclic visual question generation
2020cites this paper
Visual Question Generation
2020influential citation