Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision

Published 2021 in International Journal of Computer Vision

ABSTRACT

Transformer architectures have brought about fundamental changes to computational linguistic field, which had been dominated by recurrent neural networks for many years. Its success also implies drastic changes in cross-modal tasks with language and vision, and many researchers have already tackled the issue. In this paper, we review some of the most critical milestones in the field, as well as overall trends on how transformer architecture has been incorporated into visuolinguistic cross-modal tasks. Furthermore, we discuss its current limitations and speculate upon some of the prospects that we find imminent.

PUBLICATION RECORD

Publication year
2021
Venue
International Journal of Computer Vision
Publication date
2021-03-06
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.1007/s11263-021-01547-8 arXiv 2103.04037
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

On Position Embeddings in BERT
2021cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021cited by this paper
Transformers in Vision: A Survey
2021cited by this paper
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
2021cited by this paper
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
2021cited by this paper
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling
2021influential reference
Position Information in Transformers: An Overview
2021cited by this paper
Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer
2021cited by this paper
Zero-Shot Text-to-Image Generation
2021cited by this paper
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
2021cited by this paper
SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels
2021cited by this paper
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
2021cited by this paper
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
2021cited by this paper
VidTr: Video Transformer Without Convolutions
2021cited by this paper
MiniVLM: A Smaller and Faster Vision-Language Model
2020cited by this paper
Training data-efficient image transformers & distillation through attention
2020cited by this paper
LAMP: Label Augmented Multimodal Pretraining
2020cited by this paper
MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers
2020cited by this paper
Pre-Trained Image Processing Transformer
2020influential reference
Multimodal Pretraining for Dense Video Captioning
2020cited by this paper
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
2020cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020cited by this paper
CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations
2020influential reference
Multi-modal Transformer for Video Retrieval
2020cited by this paper
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
2020cited by this paper
Generative Pretraining From Pixels
2020influential reference
Video Understanding as Machine Translation
2020cited by this paper
ActBERT: Learning Global-Local Video-Text Representations
2020cited by this paper
Language Models are Few-Shot Learners
2020influential reference
End-to-End Object Detection with Transformers
2020cited by this paper
Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training
2020cited by this paper
The Cost of Training NLP Models: A Concise Overview
2020influential reference
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
2020cited by this paper
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
2020cited by this paper
AUTO-ENCODING VARIATIONAL BAYES
2020cited by this paper
UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
2020cited by this paper
Pre-training Tasks for Embedding-based Large-scale Retrieval
2020cited by this paper
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
2020cited by this paper
Reformer: The Efficient Transformer
2020cited by this paper
RoBERTa: A Robustly Optimized BERT Pretraining Approach
2019cited by this paper
Multi-Task Deep Neural Networks for Natural Language Understanding
2019cited by this paper
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
2019cited by this paper
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
2019cited by this paper
XLNet: Generalized Autoregressive Pretraining for Language Understanding
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
Embodied Visual Recognition
2019influential reference
Analyzing and Improving the Image Quality of StyleGAN
2019cited by this paper
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
2019influential reference
Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks
2019cited by this paper
EfficientDet: Scalable and Efficient Object Detection
2019cited by this paper
Objects365: A Large-Scale, High-Quality Dataset for Object Detection
2019cited by this paper
Transformer-Based Acoustic Modeling for Hybrid Speech Recognition
2019cited by this paper
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
2019cited by this paper
Learning Video Representations using Contrastive Bidirectional Transformer
2019cited by this paper
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
2019influential reference
Language Models are Unsupervised Multitask Learners
2019cited by this paper
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
2019cited by this paper
Unified Vision-Language Pre-Training for Image Captioning and VQA
2019cited by this paper
TinyBERT: Distilling BERT for Natural Language Understanding
2019influential reference
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
2019influential reference
Generating Long Sequences with Sparse Transformers
2019cited by this paper
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
2019influential reference
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
2019influential reference
Fusion of Detected Objects in Text for Visual Question Answering
2019influential reference
VisualBERT: A Simple and Performant Baseline for Vision and Language
2019influential reference
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
2019cited by this paper
VideoBERT: A Joint Model for Video and Language Representation Learning
2019influential reference
UNITER: UNiversal Image-TExt Representation Learning
2019influential reference
Cross-Task Weakly Supervised Learning From Instructional Videos
2019cited by this paper
BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model
2019cited by this paper
A Style-Based Generator Architecture for Generative Adversarial Networks
2018cited by this paper
Image Transformer
2018cited by this paper
Photographic Text-to-Image Synthesis with a Hierarchically-Nested Adversarial Network
2018cited by this paper
Deep Contextualized Word Representations
2018cited by this paper
End-to-End Dense Video Captioning with Masked Transformer
2018cited by this paper
GENERATIVE ADVERSARIAL NETS
2018cited by this paper
Image Captioning
2018cited by this paper
Improving Language Understanding by Generative Pre-Training
2018cited by this paper
Tracking Emerges by Colorizing Videos
2018cited by this paper
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
2018cited by this paper
Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition
2018cited by this paper
Cross-Modal and Hierarchical Modeling of Video and Text
2018influential reference
A Dataset for Telling the Stories of Social Media Videos
2018cited by this paper
A Corpus for Reasoning about Natural Language Grounded in Photographs
2018influential reference
From Recognition to Cognition: Visual Commonsense Reasoning
2018influential reference
End-to-End Retrieval in Continuous Space
2018cited by this paper
Attention is All you Need
2017influential reference
Towards Diverse and Natural Image Descriptions via a Conditional GAN
2017cited by this paper
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
2017influential reference
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
2017cited by this paper
ProcNets: Learning to Segment Procedures in Untrimmed and Unconstrained Videos
2017cited by this paper
MUTAN: Multimodal Tucker Fusion for Visual Question Answering
2017cited by this paper
Hierarchical Question-Image Co-Attention for Visual Question Answering
2016cited by this paper
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
2016cited by this paper
SPICE: Semantic Propositional Image Caption Evaluation
2016cited by this paper
YouTube-8M: A Large-Scale Video Classification Benchmark
2016cited by this paper
Layer Normalization
2016cited by this paper
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2016cited by this paper
Watch What You Just Said: Image Captioning with Text-Conditional Attention
2016cited by this paper
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
2016influential reference

CITED BY

Strategies for improving the accuracy of English translation through multimodal fusion deep learning
2026cites this paper
Immediate learning outcomes of AI- and sensor-integrated versus face-to-face CPR retraining for chest compression quality: A randomized trial
2026cites this paper
Design of a Transformer-based Cross-modal Feature Alignment Algorithm
2025cites this paper
Prompt-matching synthesis model for missing modalities in sentiment analysis
2025cites this paper
DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model
2025cites this paper
An Image Semantic Representation Method Based on Cross-Modal Adaptive Multi-Layer Perceptron
2025cites this paper
Performance Optimization and Practical Exploration of Transformer Architecture in Speech Synthesis
2025cites this paper
Dual-stream interactive networks with pearson-mask awareness for multivariate time series forecasting
2025cites this paper
Cross-Modal AI Transformer Architecture: Bridging Multiple Data Modalities Through Advanced Neural Networks
2025influential citation
Design and Implementation of Automatic Machine Translation Model Based on Deep Learning Algorithm
2025cites this paper
AI-Human Collaboration in Teacher Evaluation: A Research Agenda and Future Directions
2025cites this paper
Research on Tibetan-Chinese Translation Method Based on Improved Transformer Model
2025cites this paper
A Multimodal Visual–Textual Framework for Detection and Counting of Diseased Trees Caused by Invasive Species in Complex Forest Scenes
2025cites this paper
Deep Semantic-Consistent Penalizing Hashing for Cross-Modal Retrieval
2025cites this paper
Leveraging Pseudo-triplet and Flexible Prompt for Zero-shot Composed Image Retrieval
2025cites this paper
Optimizing Multi-Scale and Multi-Modal Fusion for Medical Image Segmentation: A Novel MedSwin-UNet Architecture
2025cites this paper
Multi Source Data Fusion and Graph Neural Network Based Joint Identification Algorithm for Adolescent Internet Addiction and Psychological Disorders
2025cites this paper
Data Component Method Based on Dual-Factor Ownership Identification with Multimodal Feature Fusion
2025cites this paper
MolSpectra: Pre-training 3D Molecular Representation with Multi-modal Energy Spectra
2025cites this paper
Masked self-supervised pre-training for steel corrosion recognition via vision transformer under limited sample conditions
2024cites this paper
Application of deep learning in cloud cover prediction using geostationary satellite images
2024cites this paper
Advances in Transformers for Robotic Applications: A Review
2024cites this paper
Deep Semantic-Aware Proxy Hashing for Multi-Label Cross-Modal Retrieval
2024cites this paper
Structure-Aware Cross-Modal Transformer for Depth Completion
2024cites this paper
Hugs Bring Double Benefits: Unsupervised Cross-Modal Hashing with Multi-granularity Aligned Transformers
2024cites this paper
ND-MRM: Neuronal Diversity Inspired Multisensory Recognition Model
2024cites this paper
M4oE: A Foundation Model for Medical Multimodal Image Segmentation with Mixture of Experts
2024cites this paper
A vision transformer‐based robotic perception for early tea chrysanthemum flower counting in field environments
2024cites this paper
When Daformer Meets Multi-Modality Datasets
2024cites this paper
Distilling Efficient Vision Transformers from CNNs for Semantic Segmentation
2023cites this paper
Transformer-Based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey
2023cites this paper
PKRT-Net: Prior knowledge-based relation transformer network for optic cup and disc segmentation
2023cites this paper
Distilling Privileged Knowledge for Anomalous Event Detection From Weakly Labeled Videos
2023cites this paper
A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks
2023cites this paper
ExpPoint-MAE: Better Interpretability and Performance for Self-Supervised Point Cloud Transformers
2023cites this paper
mapSR: A Deep Neural Network for Super-Resolution of Raster Map
2023cites this paper
Semantic Contrastive Bootstrapping for Single-Positive Multi-label Recognition
2023cites this paper
Accurate Fine-Grained Object Recognition with Structure-Driven Relation Graph Networks
2023cites this paper
AI-Based Image Generator Web Application using OpenAI’s DALL-E System
2023cites this paper
MI-MAMI: Multisensory Integration Model Inspired by the Macro and Micro Mechanisms of the Human Brain
2023cites this paper
Video Transformers: A Survey
2022cites this paper
A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval
2022cites this paper
Hugs Are Better Than Handshakes: Unsupervised Cross-Modal Transformer Hashing with Multi-granularity Alignment
2022cites this paper
Vision Transformer and Its Application in Penguin Classification
2022cites this paper
GAPCNN with HyPar: Global Average Pooling convolutional neural network with novel NNLU activation function and HYBRID parallelism
2022cites this paper
Concept formation through multimodal integration using multimodal BERT and VQ-VAE
2022cites this paper
Noise-robust Cross-modal Interactive Learning with Text2Image Mask for Multi-modal Neural Machine Translation
2022cites this paper
Cross-modal Target Retrieval for Tracking by Natural Language
2022cites this paper
Data Efficient Masked Language Modeling for Vision and Language
2021cites this paper
Core Challenges in Embodied Vision-Language Planning
2021cites this paper
NDIM: N EURONAL D IVERSITY I NSPIRED M ODEL FOR M ULTISENSORY E MOTION R ECOGNITION
year unknowncites this paper