Multimodal Few-Shot Learning with Frozen Language Models

M. Tsimpoukelli,Jacob Menick,Serkan Cabi,S. Eslami,O. Vinyals,Felix Hill,Zacharias Janssen

Published 2021 in Neural Information Processing Systems

ABSTRACT

When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.

PUBLICATION RECORD

Publication year
2021
Venue
Neural Information Processing Systems
Publication date
2021-06-25
Fields of study
Computer Science
Identifiers
arXiv 2106.13884
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
2021cited by this paper
Unifying Vision-and-Language Tasks via Text Generation
2021cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021cited by this paper
Prefix-Tuning: Optimizing Continuous Prompts for Generation
2021cited by this paper
Scaling Vision Transformers
2021cited by this paper
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning
2021cited by this paper
The Power of Scale for Parameter-Efficient Prompt Tuning
2021cited by this paper
Multi-Modal Answer Validation for Knowledge-Based VQA
2021cited by this paper
Pretrained Transformers as Universal Computation Engines
2021cited by this paper
High-Performance Large-Scale Image Recognition Without Normalization
2021cited by this paper
How Much Knowledge Can You Pack into the Parameters of a Language Model?
2020cited by this paper
Language Models are Few-Shot Learners
2020influential reference
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
2020cited by this paper
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
2020cited by this paper
What BERT Sees: Cross-Modal Transfer for Visual Question Generation
2020cited by this paper
Towards a Human-like Open-Domain Chatbot
2020cited by this paper
Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data
2020cited by this paper
REALM: Retrieval-Augmented Language Model Pre-Training
2020cited by this paper
Language Models as Knowledge Bases?
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019influential reference
Latent Translation: Crossing Modalities by Bridging Generative Models
2019cited by this paper
Visual to Text: Survey of Image and Video Captioning
2019cited by this paper
Language Models are Unsupervised Multitask Learners
2019cited by this paper
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
2019cited by this paper
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
2019cited by this paper
Encoder-Agnostic Adaptation for Conditional Language Generation
2019cited by this paper
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
2019cited by this paper
Supervised Multimodal Bitransformers for Classifying Images and Text
2019cited by this paper
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
2019cited by this paper
Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML
2019cited by this paper
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
2019cited by this paper
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
2018cited by this paper
Self-Attention with Relative Position Representations
2018cited by this paper
A Comprehensive Survey of Deep Learning for Image Captioning
2018cited by this paper
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
2018cited by this paper
Attention is All you Need
2017cited by this paper
In-datacenter performance analysis of a tensor processing unit
2017cited by this paper
Matching Networks for One Shot Learning
2016cited by this paper
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2016influential reference
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
2016influential reference
Optimization as a Model for Few-Shot Learning
2016cited by this paper
ImageNet Large Scale Visual Recognition Challenge
2014influential reference
Word learning in children: an examination of fast mapping.
1987cited by this paper

CITED BY

PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
2026cites this paper
LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs
2026cites this paper
FewMMBench: A Benchmark for Multimodal Few-Shot Learning
2026cites this paper
A Vision for Multisensory Intelligence: Sensing, Science, and Synergy
2026cites this paper
MUKA: Multi Kernel Audio Adaptation Of Audio-Language Models
2026cites this paper
Towards Understanding Best Practices for Quantization of Vision-Language Models
2026cites this paper
ETV-Attack: Efficient Text-Driven Visual-Variable Adversarial Attacks on Visual Question Answering with Pre-trained Language Models
2026cites this paper
Retrievit: In-context Retrieval Capabilities of Transformers, State Space Models, and Hybrid Architectures
2026cites this paper
Few-shot Class-Incremental Learning via Generative Co-Memory Regularization
2026cites this paper
NarrativeTrack: Evaluating Video Language Models Beyond the Frame
2026cites this paper
A Multiview‐Integrated Framework for Traffic Scene Understanding Based on YOLO and LLM
2026cites this paper
Revolutionizing sentiment analysis with generative AI: techniques, trends, and challenges
2026cites this paper
Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data
2026cites this paper
Adaptive Multi-Modal Visual Tracking With Dynamic Semantic Prompts
2026cites this paper
PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs
2026cites this paper
Private PoEtry: Private In-Context Learning via Product of Experts
2026influential citation
Hidden in the Metadata: Stealth Poisoning Attacks on Multimodal Retrieval-Augmented Generation
2026cites this paper
VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings
2026cites this paper
Talk2DM: Enabling Natural Language Querying and Commonsense Reasoning for Vehicle-Road-Cloud Integrated Dynamic Maps with Large Language Models
2026cites this paper
Think Bright, Diffuse Nice: Enhancing T2I-ICL via Inductive-Bias Hint Instruction and Query Contrastive Decoding
2026cites this paper
SimPRL: A Simple Contrastive Learning for Path Representation Learning by Joint GPS Trajectories and Road Paths
2026cites this paper
Explainable and Interactive LLMs-Augmented Depression Detection in Social Media
2026cites this paper
Semantic Leakage from Image Embeddings
2026cites this paper
ReCoD: Enhancing image description for cross-modal understanding via retrieval and comparison feedback mechanism
2026cites this paper
Bridging Modality Gaps: Cross-modal Complementary Learning with Three-Way Decision For Multimodal Intent Recognition
2026cites this paper
Social Norm Reasoning in Multimodal Language Models: An Evaluation
2026cites this paper
Improving Personalized Search with Regularized Low-Rank Parameter Updates
2025cites this paper
Bool Prompt with Decomposition and Enhancement: Zero-Shot VQA Based on PVLMs
2025cites this paper
Evaluating Language Biases in Remote Sensing Visual Question Answering: The role of spatial attributes, language diversity, and the need for clearer evaluation
2025cites this paper
Active Multimodal Distillation for Few-shot Action Recognition
2025cites this paper
FREE: Fast and Robust Vision Language Models with Early Exits
2025cites this paper
Vision Generalist Model: A Survey
2025cites this paper
Language-driven Description Generation and Common Sense Reasoning for Video Action Recognition
2025cites this paper
MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping
2025cites this paper
Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models
2025cites this paper
Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques
2025cites this paper
Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment
2025cites this paper
AI-powered integration of multimodal imaging in precision medicine for neuropsychiatric disorders
2025cites this paper
PMA: Towards Parameter-Efficient Point Cloud Understanding via Point Mamba Adapter
2025cites this paper
Neural Network Reprogrammability: A Unified Theme on Model Reprogramming, Prompt Tuning, and Prompt Instruction
2025cites this paper
On the Adversarial Robustness of Visual-Language Chat Models
2025cites this paper
Exploring inter- and intra-modal relations in compositional zero-shot learning
2025cites this paper
CLIP-Powered Domain Generalization and Domain Adaptation: A Comprehensive Survey
2025cites this paper
Multimodal artificial intelligence approaches using large language models for expert‐level landslide image analysis
2025cites this paper
ConceptFormer: Towards Efficient Use of Knowledge-Graph Embeddings in Large Language Models
2025cites this paper
Enhancing Multimodal In-Context Learning for Image Classification through Coreset Optimization
2025cites this paper
Visual Instruction Tuning with Chain of Region-of-Interest
2025cites this paper
Blind Spot Navigation: Evolutionary Discovery of Sensitive Semantic Concepts for LVLMs
2025cites this paper
Revisit Weakly Supervised Hashing With Deep Multi-Modal Foundation Models
2025cites this paper
DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models
2025cites this paper
Beyond Rule-Based Context Awareness: Large Language Models as Adaptive Cognitive Layers in Cyber-Physical Systems
2025cites this paper
Meta-prompt tuning for low-resource visual question answering
2025cites this paper
An evaluation of LLMs for generating movie reviews: GPT-4o, Gemini-2.0 and DeepSeek-V3
2025cites this paper
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
2025cites this paper
Understanding Financial Reasoning in AI: A Multimodal Benchmark and Error Learning Approach
2025cites this paper
Representation Decomposition for Learning Similarity and Contrastness Across Modalities for Affective Computing
2025cites this paper
Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering
2025cites this paper
FlowBERT: Prompt-tuned BERT for variable flow field prediction
2025cites this paper
CoLMbo: Speaker Language Model for Descriptive Profiling
2025cites this paper
Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements
2025cites this paper
ClipCap+ +: An efficient image captioning approach via image encoder optimization and LLM fine-tuning
2025cites this paper
Machine Learning for Building-Level Heat Risk Mapping
2025cites this paper
A Survey on Efficient Vision‐Language Models
2025influential citation
Dual Pairwise Pre-training and Prompt-tuning with Aligned Prototypes for Interbank Credit Rating
2025cites this paper
CubeRobot: Grounding Language in Rubik's Cube Manipulation via Vision-Language Model
2025cites this paper
MultiModalGraphSearch: Intelligent Massive-Scale SubGraph Discovery for Multi-Category Financial Pattern Mining
2025cites this paper
Few-Shot Learning for Triplet-Based EV Energy Consumption Estimation
2025cites this paper
TLAC: Two-Stage LMM Augmented CLIP for Zero-Shot Classification
2025cites this paper
DSV-LFS: Unifying LLM-Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation
2025cites this paper
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
2025cites this paper
Adapting Knowledge Prompt Tuning for Enhanced Automated Program Repair
2025cites this paper
Learning Diversified Primitive Prompts for Compositional Zero-Shot Learning
2025cites this paper
FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA
2025cites this paper
Diff-ZsVQA: Zero-shot Visual Question Answering with Frozen Large Language Models Using Diffusion Model
2025cites this paper
A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models
2025cites this paper
Generative artificial intelligence: a historical perspective
2025cites this paper
MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks
2025cites this paper
R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts
2025cites this paper
See What You Are Told: Visual Attention Sink in Large Multimodal Models
2025cites this paper
Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues
2025cites this paper
Advancing Multimodal In-Context Learning in Large Vision-Language Models with Task-aware Demonstrations
2025cites this paper
Enhancing Visual-Language Prompt Tuning Through Sparse Knowledge-Guided Context Optimization
2025cites this paper
Semantically Conditioned Prompts for Visual Recognition Under Missing Modality Scenarios
2025cites this paper
NarrAD: Automatic Generation of Audio Descriptions for Movies with Rich Narrative Context
2025cites this paper
LLaVAction: evaluating and training multi-modal large language models for action recognition
2025cites this paper
ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
2025cites this paper
Cross-Modality Prompts: Few-Shot Multi-Label Recognition With Single-Label Training
2025cites this paper
Scene-LLM: Extending Language Model for 3D Visual Reasoning
2025cites this paper
Scaling Laws for Native Multimodal Models
2025cites this paper
Visual Commonsense Causal Reasoning From a Still Image
2025cites this paper
DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis
2025cites this paper
Foundation Models in Agriculture: A Comprehensive Review
2025cites this paper
Prompt-matching synthesis model for missing modalities in sentiment analysis
2025cites this paper
Analysing the Robustness of Vision-Language-Models to Common Corruptions
2025cites this paper
Improving prompt tuning-based software vulnerability assessment by fusing source code and vulnerability description
2025cites this paper
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
2025cites this paper
In-Context Learning for Label-Efficient Cancer Image Classification in Oncology
2025cites this paper
Dcha: Distributed-Centralized Heterogeneous Architecture Enables Efficient Multi-Task Processing for Smart Sensing
2025cites this paper
Boosting few-shot action recognition via time-enhanced multimodal adaptation learning
2025cites this paper
A unified prompt-based framework for few-shot multimodal language analysis
2025cites this paper