ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Jiasen Lu,Dhruv Batra,Devi Parikh,Stefan Lee

Published 2019 in Neural Information Processing Systems

ABSTRACT

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

PUBLICATION RECORD

Publication year
2019
Venue
Neural Information Processing Systems
Publication date
2019-08-06
Fields of study
Linguistics, Computer Science
Identifiers
arXiv 1908.02265
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
2020influential reference
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
nocaps: novel object captioning at scale
2019cited by this paper
VisualBERT: A Simple and Performant Baseline for Vision and Language
2019cited by this paper
VideoBERT: A Joint Model for Video and Language Representation Learning
2019cited by this paper
BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model
2019cited by this paper
Unified Vision-Language Pre-Training for Image Captioning and VQA
2019cited by this paper
Cross-lingual Language Model Pretraining
2019cited by this paper
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
2019cited by this paper
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
2019cited by this paper
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
2018influential reference
Stacked Cross Attention for Image-Text Matching
2018cited by this paper
From Recognition to Cognition: Visual Commonsense Reasoning
2018cited by this paper
Deep Contextualized Word Representations
2018cited by this paper
Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering
2018cited by this paper
MAttNet: Modular Attention Network for Referring Expression Comprehension
2018cited by this paper
Embodied Question Answering
2017influential reference
FOIL it! Find One mismatch between Image and Language caption
2017cited by this paper
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
2017cited by this paper
Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments
2017influential reference
Modulating early visual processing by language
2017cited by this paper
Look, Listen and Learn
2017cited by this paper
Colorization as a Proxy Task for Visual Understanding
2017cited by this paper
Attention is All you Need
2017cited by this paper
ShapeCodes: Self-supervised Feature Learning by Lifting Views to Viewgrids
2017cited by this paper
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
2017cited by this paper
Mask R-CNN
2017cited by this paper
Learning Features by Watching Objects Move
2016cited by this paper
Context Encoders: Feature Learning by Inpainting
2016cited by this paper
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
2016cited by this paper
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
2016cited by this paper
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
2016cited by this paper
Colorful Image Colorization
2016cited by this paper
Visual Dialog
2016influential reference
Deep Residual Learning for Image Recognition
2015cited by this paper
Learning Image Representations Tied to Ego-Motion
2015cited by this paper
Unsupervised Visual Representation Learning by Context Prediction
2015cited by this paper
Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books
2015cited by this paper
VQA: Visual Question Answering
2015influential reference
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
2015cited by this paper
Microsoft COCO Captions: Data Collection and Evaluation Server
2015influential reference
ReferItGame: Referring to Objects in Photographs of Natural Scenes
2014cited by this paper
ImageNet Large Scale Visual Recognition Challenge
2014cited by this paper
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
2014cited by this paper
One billion word benchmark for measuring progress in statistical language modeling
2013cited by this paper
Book Review: Mind as machine: a history of cognitive science
2010cited by this paper
Mind As Machine: A History of Cognitive Science Two-Volume Set
2006cited by this paper
Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks
year unknowncited by this paper

CITED BY

Adversarial supervised contrastive feature learning for cross-modal retrieval
2026cites this paper
VEQ: Modality-Adaptive Quantization for MoE Vision-Language Models
2026cites this paper
Visual Question Answering for Intelligent Communication Systems: A Systematic Review
2026cites this paper
MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training
2026cites this paper
Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning
2026cites this paper
Graph-enhanced multimodal fusion of vascular biomarkers and deep features for diabetic retinopathy detection
2026cites this paper
Clip-based road-marking detection with LLM-guided driving prompts
2026cites this paper
Knowledge-based visual question classification using quaternion hypergraph consistent network
2026cites this paper
Evolution of Image Captioning Models: A Systematic PRISMA Review
2026cites this paper
A Multimodal Tensor Data Fused Astragalus Slices Quality Grades Method
2026cites this paper
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
2026cites this paper
PRISM: Pyramidal Representation and Interaction Synergy Mechanism for Malicious Meme Detection
2026cites this paper
A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification
2026cites this paper
Geo-TCAM: a Thangka captioning method integrating topic modeling with geometry-guided spatial attention
2026cites this paper
From task-specific to foundation models: A paradigm shift in medical vision-language analysis
2026cites this paper
Similarity-guided interaction and mismatched feature emphasis network for text-to-image person re-identification
2026cites this paper
All Changes May Have Invariant Principles: Improving Ever-Shifting Harmful Meme Detection via Design Concept Reproduction
2026cites this paper
Do Transformers Understand Ancient Roman Coin Motifs Better than CNNs?
2026cites this paper
GRASP: Guided Region-Aware Sparse Prompting for Adapting MLLMs to Remote Sensing
2026cites this paper
Training-Free Test-Time Adaptation with Brownian Distance Covariance in Vision-Language Models
2026cites this paper
Multi-layer Cross-Attention is Provably Optimal for Multi-modal In-context Learning
2026cites this paper
Vision-Language Model-Driven Human-Vehicle Interaction for Autonomous Driving: Status, Challenge, and Innovation
2026cites this paper
Multimodal action recognition for manufacturing assembly task through spatio-temporal knowledge fusion
2026cites this paper
GatedCLIP: Gated Multimodal Fusion for Hateful Memes Detection
2026cites this paper
MViR: Multi-View Visual-Semantic Representation for Fake News Detection
2026cites this paper
Multimodal anomaly detection for high-speed train control system based on attention mechanism
2026cites this paper
Quantitative Evaluation and Domain Adaptation of Vision–Language Models for Mixed-Reality Interpretation of Indoor Environmental Computational Fluid Dynamics Visualizations
2026cites this paper
MRID: Modeling Radiological Image Differences for Disease Progression Reasoning via Multi-Task Self-Supervision
2026cites this paper
ReCoD: Enhancing image description for cross-modal understanding via retrieval and comparison feedback mechanism
2026cites this paper
BearGen: LLM-guided signal generation framework for bearing fault diagnosis
2026cites this paper
Insight in Sight: Complaint Detection and Aspect-Based Reasoning Through Visually-Grounded Reviews With VLLMs
2026cites this paper
A guard against ambiguous sentiment for multimodal aspect-level sentiment classification
2026cites this paper
A review of instruction-guided image editing
2026cites this paper
DCGRM-Net: Dual-Channel Guided Reconstruction Mamba Network for robust multimodal sentiment analysis
2026cites this paper
Semantic Consistency Interaction With Calibration Loss for Remote Sensing Image–Text Retrieval
2026cites this paper
Jailbreaking LLMs & VLMs: Mechanisms, Evaluation, and Unified Defense
2026cites this paper
Model and Algorithms for Classifying Anomalous Phenomena based on the Convergence of Acoustic-Visual Signals
2026cites this paper
Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning
2026cites this paper
Evaluating Self-Correcting Vision Agents Through Quantitative and Qualitative Metrics
2026cites this paper
TensorLens: End-to-End Transformer Analysis via High-Order Attention Tensors
2026influential citation
MEIDNet: Multimodal generative AI framework for inverse materials design
2026cites this paper
Graph is a Substrate Across Data Modalities
2026cites this paper
Robust Harmful Meme Detection under Missing Modalities via Shared Representation Learning
2026cites this paper
When Attention Betrays: Erasing Backdoor Attacks in Robotic Policies by Reconstructing Visual Tokens
2026cites this paper
Quantifying and Communicating Uncertainty in SAR-Based Flood Mapping via Density-Aware Neural Networks and Conformal Risk Control
2026cites this paper
ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos
2026cites this paper
Using transformer-based models for Vietnamese language detection
2026cites this paper
Q-ALIGNer: A Quantum Entanglement-Driven Multimodal Framework for Robust Fake News Detection
2026cites this paper
MVF-XT: An interpretable multi-view fusion network based on cross-attention for fMRI analysis
2026cites this paper
MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning
2026cites this paper
Revolutionizing sentiment analysis with generative AI: techniques, trends, and challenges
2026cites this paper
Why So Meme? A Comparative and Explainable Analysis of Multimodal Hateful Meme Detection
2026cites this paper
CrossLLM-Mamba: Multimodal State Space Fusion of LLMs for RNA Interaction Prediction
2026cites this paper
Negative-Sampling Prompt Learning for Hard Negative Sample Discrimination
2026cites this paper
VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings
2026cites this paper
CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment
2026cites this paper
VQA: detecting ambiguity and generating multiple answer candidates for clarifying visual questions
2026cites this paper
A systematic review of vision language models: Comprehensive analysis of architectures, applications, datasets and challenges towards robust multimodal intelligence
2026cites this paper
Multimodal fusion and knowledge enhancement for accurate video captioning
2026cites this paper
In Transformer We Trust? A Perspective on Transformer Architecture Failure Modes
2026cites this paper
Semantic-Aware Remote Sensing Visual Question Answering via Segmentation-Guided Learning
2026cites this paper
Adaptive Bottleneck Transformer for Multimodal EEG, Audio, and Vision Fusion
2026cites this paper
Not All Attention is Needed: Parameter and Computation Efficient Tuning for Multi-modal Large Language Models via Effective Attention Skipping
2026cites this paper
ICQ-TransE: LLM-Enhanced Image-Caption-Question Translating Embeddings for Knowledge-Based Visual Question Answering
2026cites this paper
Federated Learning on Heterogeneous and Long-Tailed Data via Disentangled Representation
2026cites this paper
MSAF: Multimodal Sentiment Detection via Multiscale Adaptive Fusion
2026cites this paper
Text-Conditional Visual-Language Alignment for Video Captioning
2026cites this paper
Enhance multi-modal structured representations with open information extraction
2026cites this paper
Dynamic model scaling based on segmented tumor size for breast cancer detection
2026cites this paper
Large-scale multimodal model based embodied intelligent robots: A survey
2026cites this paper
A KAN-Enhanced Siamese Transformer Pyramid Network for Multi-MSFA Demosaicing
2026cites this paper
Question-guided attention and cross-modal alignment for knowledge-based visual question answering
2026cites this paper
BARE: Towards Bias-Aware and Reasoning-Enhanced One-Tower Visual Grounding
2026cites this paper
EarthVL: A Progressive Earth Vision-Language Understanding and Generation Framework
2026cites this paper
LLM-Based Pose Normalization and Multimodal Fusion for Facial Expression Recognition in Extreme Poses
2026cites this paper
PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs
2026cites this paper
Temporal Fusion Nexus: A task-agnostic multi-modal embedding model for clinical narratives and irregular time series in post-kidney transplant care
2026cites this paper
Mathematical Frameworks in Image Captioning: A Comprehensive Survey and Real-Time Processing Analysis
2026cites this paper
An Empirical Study of the Imbalance Issue in Software Vulnerability Detection
2026cites this paper
Privileged information assisted learning from noisy correspondence
2026cites this paper
Learning Audio-Visual Embeddings with Inferred Latent Interaction Graphs
2026cites this paper
Multilingual multimodal cyberbullying detection through adaptive and hierarchical fusion
2026cites this paper
SATA: Sparsity-Aware Scheduling for Selective Token Attention
2026cites this paper
Compression Tells Intelligence: Visual Coding, Visual Token Technology, and the Unification
2026cites this paper
Multimodal sparse fusion transformer network with spatio-temporal decoupling for breast tumor classification.
2026cites this paper
Artificial intelligence in plant science: from image-based phenotyping to yield and trait prediction
2026cites this paper
Hierarchical Open-vocabulary Part-object Segmentation with Knowledge-guided SAM
2026cites this paper
Transforming Vehicle Diagnostics: A Multimodal Approach to Error Patterns Prediction
2026cites this paper
Modality as Heterogeneity: Node Splitting and Graph Rewiring for Multimodal Graph Learning
2026cites this paper
Reproduction of Original Glioblastoma and Brain Metastasis Research Findings Using Synthetic Data.
2025cites this paper
A transformer based multi task learning approach to multimodal hate speech detection
2025cites this paper
Content-aware sentiment understanding: cross-modal analysis with encoder-decoder architectures
2025cites this paper
Modular Prompt Learning Improves Vision-Language Models
2025cites this paper
Enhancing Chest X-ray Diagnosis with a Multimodal Deep Learning Network by Integrating Clinical History to Refine Attention
2025cites this paper
Enhancing Recommender Systems: Deep Modality Alignment with Large Multi-Modal Encoders
2025cites this paper
Multi-Faceted Multimodal Monosemanticity
2025cites this paper
CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space
2025cites this paper
Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning
2025cites this paper
A Review on Vision-Language-Based Approaches: Challenges and Applications
2025cites this paper
Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook
2025influential citation