Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks

Corentin Kervadec,G. Antipov,M. Baccouche,Christian Wolf

Published 2019 in European Conference on Artificial Intelligence

ABSTRACT

The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles has recently resulted in a number of high performing models on a large panoply of vision-and-language problems (such as Visual Question Answering (VQA), image retrieval, etc.). In this paper we claim that these State-Of-The-Art (SOTA) approaches perform reasonably well in structuring information inside a single modality but, despite their impressive performances , they tend to struggle to identify fine-grained inter-modality relationships. Indeed, such relations are frequently assumed to be implicitly learned during training from application-specific losses, mostly cross-entropy for classification. While most recent works provide inductive bias for inter-modality relationships via cross attention modules, in this work, we demonstrate (1) that the latter assumption does not hold, i.e. modality alignment does not necessarily emerge automatically, and (2) that adding weak supervision for alignment between visual objects and words improves the quality of the learned models on tasks requiring reasoning. In particular , we integrate an object-word alignment loss into SOTA vision-language reasoning models and evaluate it on two tasks VQA and Language-driven Comparison of Images. We show that the proposed fine-grained inter-modality supervision significantly improves performance on both tasks. In particular, this new learning signal allows obtaining SOTA-level performances on GQA dataset (VQA task) with pre-trained models without finetuning on the task, and a new SOTA on NLVR2 dataset (Language-driven Comparison of Images). Finally, we also illustrate the impact of the contribution on the models reasoning by visualizing attention distributions.

PUBLICATION RECORD

Publication year
2019
Venue
European Conference on Artificial Intelligence
Publication date
2019-12-04
Fields of study
Computer Science
Identifiers
DOI 10.3233/FAIA200412 arXiv 1912.03063
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
2020cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019influential reference
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
2019cited by this paper
UNITER: Learning UNiversal Image-TExt Representations
2019cited by this paper
Learning by Abstraction: The Neural State Machine
2019influential reference
Deep Modular Co-Attention Networks for Visual Question Answering
2019cited by this paper
Language-Conditioned Graph Networks for Relational Reasoning
2019cited by this paper
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
2019influential reference
BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection
2019cited by this paper
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
2019influential reference
Learning Conditioned Graph Structures for Interpretable Visual Question Answering
2018cited by this paper
Object Level Visual Reasoning in Videos
2018cited by this paper
Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding
2018cited by this paper
A brief introduction to weakly supervised learning
2018cited by this paper
Compositional Attention Networks for Machine Reasoning
2018cited by this paper
Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering
2018cited by this paper
Stacked Cross Attention for Image-Text Matching
2018cited by this paper
A Corpus for Reasoning about Natural Language Grounded in Photographs
2018influential reference
Attention is All you Need
2017influential reference
Visual Question Generation as Dual Task of Visual Question Answering
2017cited by this paper
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
2017cited by this paper
A simple neural network module for relational reasoning
2017cited by this paper
MUTAN: Multimodal Tucker Fusion for Visual Question Answering
2017cited by this paper
FiLM: Visual Reasoning with a General Conditioning Layer
2017cited by this paper
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
2016influential reference
Semi-Supervised Classification with Graph Convolutional Networks
2016cited by this paper
Graph-Structured Representations for Visual Question Answering
2016cited by this paper
Layer Normalization
2016cited by this paper
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2016influential reference
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
2016cited by this paper
VQA: Visual Question Answering
2015influential reference
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
2015influential reference
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
2015cited by this paper
Exploring Models and Data for Image Question Answering
2015cited by this paper
Generating Images from Captions with Attention
2015cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
GloVe: Global Vectors for Word Representation
2014cited by this paper
Deep visual-semantic alignments for generating image descriptions
2014cited by this paper
Microsoft COCO: Common Objects in Context
2014cited by this paper
ReferItGame: Referring to Objects in Photographs of Natural Scenes
2014cited by this paper

CITED BY

IntraCross: Cross-modality graph matching for intravascular sequence registration.
2026cites this paper
What do MLLMs hear? Examining reasoning with text and sound components in Multimodal Large Language Models
2024cites this paper
RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression Segmentation
2024cites this paper
Webly Supervised Knowledge-Embedded Model for Visual Reasoning
2023cites this paper
Knowledge-Embedded Mutual Guidance for Visual Reasoning
2023cites this paper
That’s the Wrong Lung! Evaluating and Improving the Interpretability of Unsupervised Multimodal Encoders for Medical Data
2022cites this paper
Does Structural Attention Improve Compositional Representations in Vision-Language Models?
2022cites this paper
An experimental study of the vision-bottleneck in VQA
2022cites this paper
Unsupervised and Pseudo-Supervised Vision-Language Alignment in Visual Dialog
2022cites this paper
Supervising the Transfer of Reasoning Patterns in VQA
2021cites this paper
Weakly Supervised Relative Spatial Reasoning for Visual Question Answering
2021cites this paper
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision
2021cites this paper
Contrastive Learning for Natural Language-Based Vehicle Retrieval
2021cites this paper
How Transferable are Reasoning Patterns in VQA?
2021cites this paper
VisQA: X-raying Vision and Language Reasoning in Transformers
2021cites this paper
Complexit´e de l’´echantillon et syst`emes de questions r´eponses visuelles
year unknowncites this paper