Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

Gen Li,Nan Duan,Yuejian Fang,Daxin Jiang,Ming Zhou

Published 2019 in AAAI Conference on Artificial Intelligence

ABSTRACT

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM (Lample and Conneau 2019) and Unicoder (Huang et al. 2019), both visual and linguistic contents are fed into a multi-layer Transformer (Vaswani et al. 2017) for the cross-modal pre-training, where three pre-trained tasks are employed, including Masked Language Modeling(MLM), Masked Object Classification(MOC) and Visual-linguistic Matching(VLM). The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and a text describe each other. After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer. We achieve state-of-the-art or comparable results on both two tasks and show the powerful ability of the cross-modal pre-training.

PUBLICATION RECORD

Publication year
2019
Venue
AAAI Conference on Artificial Intelligence
Publication date
2019-08-16
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.1609/AAAI.V34I07.6795 arXiv 1908.06066
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Cross-lingual Language Model Pretraining
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019influential reference
Fusion of Detected Objects in Text for Visual Question Answering
2019cited by this paper
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
2019influential reference
RoBERTa: A Robustly Optimized BERT Pretraining Approach
2019cited by this paper
Position Focused Attention Network for Image-Text Matching
2019cited by this paper
XLNet: Generalized Autoregressive Pretraining for Language Understanding
2019influential reference
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
2019influential reference
VideoBERT: A Joint Model for Video and Language Representation Learning
2019cited by this paper
UNITER: Learning UNiversal Image-TExt Representations
2019influential reference
VisualBERT: A Simple and Performant Baseline for Vision and Language
2019influential reference
Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks
2019cited by this paper
Knowledge Aware Semantic Concept Expansion for Image-Text Matching
2019cited by this paper
Improving Language Understanding by Generative Pre-Training
2018cited by this paper
Deep Contextualized Word Representations
2018cited by this paper
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
2018influential reference
Pythia-A platform for vision & language research
2018cited by this paper
Stacked Cross Attention for Image-Text Matching
2018cited by this paper
From Recognition to Cognition: Visual Commonsense Reasoning
2018cited by this paper
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
2017cited by this paper
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
2017cited by this paper
Learning Semantic Concepts and Order for Image and Sentence Matching
2017cited by this paper
Attention is All you Need
2017cited by this paper
Dual-Path Convolutional Image-Text Embedding
2017cited by this paper
Dual-path Convolutional Image-Text Embeddings with Instance Loss
2017cited by this paper
VSE++: Improved Visual-Semantic Embeddings
2017cited by this paper
SQuAD: 100,000+ Questions for Machine Comprehension of Text
2016cited by this paper
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
2016cited by this paper
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
2016cited by this paper
A large annotated corpus for learning natural language inference
2015cited by this paper
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
2015cited by this paper
Learning Deep Structure-Preserving Image-Text Embeddings
2015cited by this paper
Multimodal Convolutional Neural Networks for Matching Image and Sentence
2015cited by this paper
VQA: Visual Question Answering
2015cited by this paper
Microsoft COCO Captions: Data Collection and Evaluation Server
2015influential reference
Deep Residual Learning for Image Recognition
2015cited by this paper
Deep visual-semantic alignments for generating image descriptions
2014influential reference
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
2014cited by this paper
Very Deep Convolutional Networks for Large-Scale Image Recognition
2014cited by this paper
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
2013cited by this paper
Im2Text: Describing Images Using 1 Million Captioned Photographs
2011cited by this paper
ImageNet: A large-scale hierarchical image database
2009influential reference
Networks
2007cited by this paper
of the Association for Computational Linguistics
year unknowncited by this paper

CITED BY

Sematrack: semantic-driven unified vision-language tracking
2026cites this paper
Beyond One and Two Tower: Cross-Modal Consensus Learning for Image-Text Retrieval
2026cites this paper
Multimodal knowledge-guided method for power transmission line fault detection using a vision-language model
2026cites this paper
Enhance multi-modal structured representations with open information extraction
2026cites this paper
Fine-Grained Disentanglement for Alleviating Inconsistencies in Cross-Modal Hashing Retrieval
2026cites this paper
Mathematical Frameworks in Image Captioning: A Comprehensive Survey and Real-Time Processing Analysis
2026cites this paper
Adversarial supervised contrastive feature learning for cross-modal retrieval
2026cites this paper
Fine-Grained Multimodal Alignment for Image-Text Retrieval via Graph Learning
2026cites this paper
Referred Segmentation on Single/ No Target Image
2025cites this paper
Similarity Shuffled Criss-Cross Transformer With Angle Loss for Image-Text Matching
2025cites this paper
MAMF-Net: Modality-Adaptive Masked Fusion Network for Speech Emotion Recognition
2025cites this paper
MultiModal craniocerebral diagnose based on 3D CT and image reports
2025cites this paper
Precision at scale: Domain-specific datasets on-demand
2025cites this paper
Enhancing Intermodal Interaction for Unified Vision-Language Understanding and Generation
2025cites this paper
Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations
2025cites this paper
Emergent Corpus Pre-training Benefits Vision Language Models
2025cites this paper
MDVL-Edit: Mask-assisted highly disentangled text-driven face image editing based on vision-language alignment
2025cites this paper
Anatomy-guided slice-description interaction for multimodal brain disease diagnosis based on 3D image and radiological report
2025cites this paper
Transformers in speech processing: Overcoming challenges and paving the future
2025cites this paper
Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model
2025cites this paper
Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation
2025cites this paper
HPformer: Low-Parameter Transformer With Temporal Dependency Hierarchical Propagation for Health Informatics
2025cites this paper
Multimodal named entity recognition in the era of large pre-trained models: A comprehensive survey
2025cites this paper
Masked Diffusion Captioning for Visual Feature Learning
2025cites this paper
ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis
2025cites this paper
Liver Disease Classification in Ultrasound Image with Vision-Language Pretraining
2025cites this paper
A Cross-Modal Information Retrieval Framework Based on an Interactive Encoder and Re-Ranking Algorithm
2025cites this paper
Research on Understanding and Improving Deep Hashing Retrieval under Incremental Data
2025cites this paper
SST-LLM: time series forecasting based on large language models
2025cites this paper
A Survey of Task-Oriented Knowledge Graph Reasoning: Status, Applications, and Prospects
2025cites this paper
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
2025cites this paper
Hierarchical Vision–Language Pre-Training with Freezing Strategy for Multi-Level Semantic Alignment
2025cites this paper
Stress Detection using Multimodal Representation Learning, Fusion Techniques, and Applications
2025cites this paper
GENIUS: A Generative Framework for Universal Multimodal Search
2025cites this paper
Multi-granularity relation-aware and conditional query learning for text-based person search
2025cites this paper
Composed Multi-modal Retrieval: A Survey of Approaches and Applications
2025cites this paper
Optimal Transport for Brain-Image Alignment: Unveiling Redundancy and Synergy in Neural Information Processing
2025cites this paper
Injecting Multimodal Information Into Pre-Trained Language Model for Multimodal Sentiment Analysis
2025cites this paper
Large Vison-Language Foundation Model in Baidu AIGC Image Advertising
2025cites this paper
A Electric Power Scene Object Detection Algorithm Based on Meta-learning and Category Decoupling
2025cites this paper
Research on Digital Media Art for Image Caption Generation Based on Integrated Transformer Models in CLIP
2025cites this paper
CLIP-driven attention network for multimodal sentiment analysis
2025influential citation
Manager: Aggregating Insights From Unimodal Experts in Two-Tower VLMs and MLLMs
2025cites this paper
VCRMNER: Visual Cue Refinement in Multimodal NER using CLIP Prompts
2025cites this paper
T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation
2025cites this paper
Can VLMs Actually See and Read? A Survey on Modality Collapse in Vision-Language Models
2025cites this paper
Exploring the Enhancement of Transferability of Multimodal Adversarial Examples in Vision-Language Pretraining Models
2025cites this paper
SATrack: Semantic-Aware Alignment Framework for Visual–Language Tracking
2025cites this paper
Large-Small Model Synergy with Multimodal Fine-Grained Heuristics for Knowledge-Based Visual Question Answering
2025cites this paper
Understanding generative AI to harness its potentials and mini- mize risks: A perspective.
2025cites this paper
Cross modal recipe retrieval with fine grained modal interaction
2025cites this paper
TIETracker: A CLIP-based RGB-T Tracking via Feature Interaction and Semantic Enhancement
2025cites this paper
MKFTracker: An RGBT tracker via multimodal knowledge embedding and feature interaction
2025cites this paper
Purify Then Guide: A Bi-Directional Bridge Network for Open-Vocabulary Semantic Segmentation
2025cites this paper
Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multimodal Interaction
2025cites this paper
Enhanced cross-modal parallel training for improving end-to-end accented speech recognition
2025cites this paper
Vision-Centric Activation and Coordination for Multimodal Large Language Models
2025cites this paper
A Comprehensive Review of Multimodal Visual Representation Learning: Tracing the Evolution from CNNs to Transformers and Beyond
2025cites this paper
Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic Segmentation
2025cites this paper
Cross-modal event extraction based on Adaptive Feature Selection and Semantic-Aware Graph
2025cites this paper
Prototype-Guided Multilayer Alignment Network for Few-Shot Object Detection in Remote Sensing
2025cites this paper
Composed Query-Based Event Retrieval in Video Corpus with Multimodal Episodic Perceptron
2025cites this paper
Dynamic semantic prototype perception for text-video retrieval
2025cites this paper
UPL-Net: Uncertainty-aware prompt learning network for semi-supervised action recognition
2025cites this paper
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
2025cites this paper
DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data
2025cites this paper
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
2025cites this paper
Instruction-guided path planning with 3D semantic maps for vision-language navigation
2025cites this paper
Aesthetic multi-attributes network for image captioning
2025cites this paper
UniTrans: Unified Parameter-Efficient Transfer Learning and Multimodal Alignment for Large Multimodal Foundation Model
2025cites this paper
ACF-R+: An asymmetry-sensitive method for image-text retrieval enhanced by cross-modal fusion and re-ranking based on contrastive learning
2025cites this paper
Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook
2025cites this paper
Novel cross-dimensional coarse-fine-grained complementary network for image-text matching
2025cites this paper
CGViT: Cross-image GroupViT for zero-shot semantic segmentation
2025cites this paper
Distilling vision-language pre-training models with modality-specific meta-learning
2025cites this paper
Boosting Movie and TV Tag Accuracy with Knowledge Graphs*
2025cites this paper
VLSG-net: Vision-Language Scene Graphs network for Paragraph Video Captioning
2025cites this paper
Local-Level Feature Aggregation with Attribute Anchors for Text-Guided Image Retrieval
2025cites this paper
Multimodal Transformer for Indonesian Image Text Retrieval
2025cites this paper
Cross-Aligned Fusion For Multimodal Understanding
2025cites this paper
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
2025cites this paper
Affect and Personality Aided Modeling of Transcribed Speech for Depression Severity Estimation
2025cites this paper
Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI
2025cites this paper
High-precision large-aperture single-frame interferometric surface profile measurement method based on deep learning
2025cites this paper
Large language model augmentation technology for intelligent software development
2025cites this paper
A Multimodal Large Language Model Framework for Intelligent Perception and Decision-Making in Smart Manufacturing
2025cites this paper
VADS: Visuo-Adaptive DualStrike attack on visual question answer
2024cites this paper
Image-text multimodal classification via cross-attention contextual transformer with modality-collaborative learning
2024cites this paper
VidLPRO: A Video-Language Pre-training Framework for Robotic and Laparoscopic Surgery
2024cites this paper
Explaining Caption-Image Interactions in CLIP Models with Second-Order Attributions
2024cites this paper
Vision Semantics Image Captioner
2024cites this paper
An end-to-end image-text matching approach considering semantic uncertainty
2024cites this paper
Applying machine learning to optical metrology: a review
2024cites this paper
Efficient and Versatile Robust Fine-Tuning of Zero-shot Models
2024cites this paper
Bridging Modalities: A Survey of Cross-Modal Image-Text Retrieval
2024cites this paper
Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation
2024cites this paper
Leveraging Customer Feedback for Multi-modal Insight Extraction
2024cites this paper
Integrating Vision-Language Semantic Graphs in Multi-View Clustering
2024cites this paper
FlexAttention for Efficient High-Resolution Vision-Language Models
2024cites this paper
A Survey on Integrated Sensing, Communication, and Computation
2024cites this paper