ImageBind One Embedding Space to Bind Them All

Rohit Girdhar,Alaaeldin El-Nouby,Zhuang Liu,Mannat Singh,Kalyan Vasudev Alwala,Armand Joulin,Ishan Misra

Published 2023 in Computer Vision and Pattern Recognition

ABSTRACT

We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications ‘out-of-the-box’ including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.

PUBLICATION RECORD

Publication year
2023
Venue
Computer Vision and Pattern Recognition
Publication date
2023-05-09
Fields of study
Computer Science
Identifiers
DOI 10.1109/CVPR52729.2023.01457 arXiv 2305.05665
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Detecting Twenty-thousand Classes using Image-level Supervision
2022influential reference
Reproducible Scaling Laws for Contrastive Language-Image Learning
2022influential reference
LAION-5B: An open large-scale dataset for training next generation image-text models
2022cited by this paper
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment
2022cited by this paper
Exploring Target Representations for Masked Autoencoders
2022cited by this paper
BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
2022cited by this paper
Frozen CLIP Models are Efficient Video Learners
2022cited by this paper
Masked Autoencoders that Listen
2022influential reference
OmniMAE: Single Model Masked Pretraining on Images and Videos
2022cited by this paper
CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes
2022cited by this paper
Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation
2022cited by this paper
Masked Autoencoders As Spatiotemporal Learners
2022cited by this paper
CoCa: Contrastive Captioners are Image-Text Foundation Models
2022cited by this paper
Flamingo: a Visual Language Model for Few-Shot Learning
2022cited by this paper
Hierarchical Text-Conditional Image Generation with CLIP Latents
2022influential reference
MultiMAE: Multi-modal Multi-task Masked Autoencoders
2022cited by this paper
Learning Audio-Video Modalities from Image Captions
2022influential reference
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
2022cited by this paper
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection
2022cited by this paper
Omnivore: A Single Model for Many Visual Modalities
2022cited by this paper
Revisiting Weakly Supervised Pre-Training of Visual Perception Models
2022cited by this paper
Multiview Transformers for Video Recognition
2022cited by this paper
Language-driven Semantic Segmentation
2022cited by this paper
Efficient Training of Audio Transformers with Patchout
2021cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021cited by this paper
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
2021cited by this paper
Slow-Fast Auditory Streams for Audio Recognition
2021cited by this paper
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
2021cited by this paper
AST: Audio Spectrogram Transformer
2021cited by this paper
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
2021cited by this paper
Emerging Properties in Self-Supervised Vision Transformers
2021cited by this paper
Audio Retrieval with Natural Language Queries
2021cited by this paper
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
2021cited by this paper
Audioclip: Extending Clip to Image, Text and Audio
2021influential reference
Attention Bottlenecks for Multimodal Fusion
2021cited by this paper
LLVIP: A Visible-infrared Paired Dataset for Low-light Vision
2021cited by this paper
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
2021cited by this paper
Ego4D: Around the World in 3,000 Hours of Egocentric Video
2021cited by this paper
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
2021cited by this paper
LiT: Zero-Shot Transfer with Locked-image text Tuning
2021cited by this paper
Florence: A New Foundation Model for Computer Vision
2021cited by this paper
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
2021cited by this paper
BEVT: BERT Pretraining of Video Transformers
2021cited by this paper
PointCLIP: Point Cloud Understanding by CLIP
2021cited by this paper
Vggsound: A Large-Scale Audio-Visual Dataset
2020influential reference
Multi-modal Self-Supervision from Generalized Data Transformations
2020cited by this paper
Training data-efficient image transformers & distillation through attention
2020cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020cited by this paper
Support-set bottlenecks for video-text representation learning
2020cited by this paper
A Simple Framework for Contrastive Learning of Visual Representations
2020cited by this paper
Self-Supervised MultiModal Versatile Networks
2020cited by this paper
Audio-Visual Instance Discrimination with Cross-Modal Agreement
2020cited by this paper
Contrastive Multiview Coding
2019influential reference
Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer
2019cited by this paper
End-to-End Learning of Visual Representations From Uncurated Instructional Videos
2019cited by this paper
Randaugment: Practical automated data augmentation with a reduced search space
2019cited by this paper
AudioCaps: Generating Captions for Audios in The Wild
2019cited by this paper
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
2019cited by this paper
Unsupervised Feature Learning via Non-parametric Instance Discrimination
2018cited by this paper
Representation Learning with Contrastive Predictive Coding
2018influential reference
Exploring the Limits of Weakly Supervised Pretraining
2018cited by this paper
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
2018cited by this paper
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
2017cited by this paper
Attention is All you Need
2017cited by this paper
Look, Listen and Learn
2017cited by this paper
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
2017cited by this paper
Unsupervised Machine Translation Using Monolingual Corpora Only
2017cited by this paper
Random Erasing Data Augmentation
2017cited by this paper
The Kinetics Human Action Video Dataset
2017cited by this paper
Audio Set: An ontology and human-labeled dataset for audio events
2017cited by this paper
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
2016cited by this paper
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
2016cited by this paper
Deep Networks with Stochastic Depth
2016cited by this paper
ESC: Dataset for Environmental Sound Classification
2015cited by this paper
SUN RGB-D: A RGB-D scene understanding benchmark suite
2015cited by this paper
Learning Visual Features from Large Weakly Supervised Data
2015cited by this paper
Learning Rich Features from RGB-D Images for Object Detection and Segmentation
2014cited by this paper
Learning Deep Features for Scene Recognition using Places Database
2014cited by this paper
Grounded Compositional Semantics for Finding and Describing Images with Sentences
2014cited by this paper
ImageNet Large Scale Visual Recognition Challenge
2014cited by this paper
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
2014cited by this paper
Freesound technical demo
2013cited by this paper
DeViSE: A Deep Visual-Semantic Embedding Model
2013cited by this paper
Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images
2013cited by this paper
Indoor Segmentation and Support Inference from RGBD Images
2012cited by this paper
Dimensionality Reduction by Learning an Invariant Mapping
2006cited by this paper

CITED BY

TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation
2026cites this paper
SeisBind: Physics-Aware Tri-Modal Representation Binding for Seismic Data via Contrastive Learning
2026cites this paper
A Lightweight Radar–Camera Fusion Deep Learning Model for Human Activity Recognition
2026cites this paper
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
2026influential citation
Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine
2026cites this paper
SPGDD-GPT: Image-Text-Driven Generic Defect Diagnosis Using a Self-Prompted Large Vision-Language Model
2026cites this paper
AviationLMM: A Large Multimodal Foundation Model for Civil Aviation
2026cites this paper
DR-VAD: Definition-guided reasoning for training-free video anomaly detection
2026cites this paper
BrokenBind: Universal Modality Exploration beyond Dataset Boundaries
2026influential citation
Time Series, Vision, and Language: Exploring the Limits of Alignment in Contrastive Representation Spaces
2026cites this paper
Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection
2026cites this paper
MOVA: Towards Scalable and Synchronized Video-Audio Generation
2026cites this paper
Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?
2026influential citation
Closing the Modality Gap Aligns Group-Wise Semantics
2026cites this paper
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
2026cites this paper
HMS2Net: Heterogeneous Multimodal State Space Network via CLIP for Dynamic Scene Classification in Livestreaming
2026cites this paper
SpecBridge: Bridging Mass Spectrometry and Molecular Representations via Cross-Modal Alignment
2026cites this paper
Token Entropy Regularization for Multi-modal Antenna Affiliation Identification
2026cites this paper
Asymmetric Hierarchical Anchoring for Audio-Visual Joint Representation: Resolving Information Allocation Ambiguity for Robust Cross-Modal Generalization
2026cites this paper
AnyThermal: Towards Learning Universal Representations for Thermal Perception
2026influential citation
Towards Uniformity and Alignment for Multimodal Representation Learning
2026cites this paper
MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment
2026cites this paper
RA-Det: Towards Universal Detection of AI-Generated Images via Robustness Asymmetry
2026cites this paper
A Mixed Diet Makes DINO An Omnivorous Vision Encoder
2026cites this paper
Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment
2026cites this paper
Towards Training-free Multimodal Hate Localisation with Large Language Models
2026cites this paper
Hierarchical vision-language model with comprehensive language description for video anomaly detection
2026cites this paper
Conditional Flow Matching for Visually-Guided Acoustic Highlighting
2026cites this paper
CoVA: Text-Guided Composed Video Retrieval for Audio-Visual Content
2026cites this paper
GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining
2026cites this paper
Evaluating the Adversarial Robustness of Vision-Language Models via Internal Feature Perturbations
2026cites this paper
StaProDyn: A unified framework for multimodal sentiment analysis with stability-aware filtering, prompt learning enhancement, and dynamic fusion
2026cites this paper
A Vision for Multisensory Intelligence: Sensing, Science, and Synergy
2026cites this paper
Model and Algorithms for Classifying Anomalous Phenomena based on the Convergence of Acoustic-Visual Signals
2026cites this paper
Not all Blends are Equal: The BLEMORE Dataset of Blended Emotion Expressions with Relative Salience Annotations
2026cites this paper
Acoustic Field Video for Multimodal Scene Understanding
2026influential citation
SAM Audio Judge: A Unified Multimodal Framework for Perceptual Evaluation of Audio Separation
2026influential citation
DKMap: Interactive Exploration of Vision-Language Alignment in Multimodal Embeddings via Dynamic Kernel Enhanced Projection
2026cites this paper
Modality as Heterogeneity: Node Splitting and Graph Rewiring for Multimodal Graph Learning
2026cites this paper
TextME: Bridging Unseen Modalities Through Text Descriptions
2026influential citation
CyIN: Cyclic Informative Latent Space for Bridging Complete and Incomplete Multimodal Learning
2026cites this paper
Generalist multimodal AI: A review of architectures, challenges and opportunities
2026cites this paper
Efficient Table Retrieval and Understanding with Multimodal Large Language Models
2026cites this paper
AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization
2026cites this paper
OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model
2026cites this paper
Scaling Audio-Text Retrieval with Multimodal Large Language Models
2026cites this paper
Modal-aware Diffusion-enhanced with Multi-level Negative Sampling for Multimodal-based Recommendation
2026cites this paper
OmniRet: Efficient and High-Fidelity Omni Modality Retrieval
2026influential citation
OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments
2026cites this paper
Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining
2026cites this paper
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
2026cites this paper
LAMMI-Pathology: A Tool-Centric Bottom-Up LVLM-Agent Framework for Molecularly Informed Medical Intelligence in Pathology
2026cites this paper
sleep2vec: Unified Cross-Modal Alignment for Heterogeneous Nocturnal Biosignals
2026cites this paper
Motion Capture is Not the Target Domain: Scaling Synthetic Data for Learning Motion Representations
2026cites this paper
Reasoning-Augmented Representations for Multimodal Retrieval
2026cites this paper
TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models
2026cites this paper
DiffMusic: Efficient Music Generation From a Single Image Using Diffusion-Based Representations
2026cites this paper
OpenMAG: A Comprehensive Benchmark for Multimodal-Attributed Graph
2026cites this paper
Audio-to-Image Bird Species Retrieval without Audio-Image Pairs via Text Distillation
2026cites this paper
Toward Enhancing Representation Learning in Federated Multi-Task Settings
2026cites this paper
SFQA: A Comprehensive Perceptual Quality Assessment Dataset for Singing Face Generation
2026cites this paper
A Survey on Video Captioning in the Era of Large Language Models
2026cites this paper
Toward General Industrial Intelligence: A Survey of Large Models as a Service in Industrial IoT
2026cites this paper
Voice2Visage: Deciphering Faces From Voices
2026cites this paper
Affection-Guided Bottleneck Diffusion for Missing Modality Issue in Multimodal Affective Computing
2026cites this paper
Multi-Modal Data-Enhanced Foundation Models for Prediction and Control in Wireless Networks: A Survey
2026cites this paper
RadDiff: Describing Differences in Radiology Image Sets with Natural Language
2026cites this paper
Prompting Underestimates LLM Capability for Time Series Classification
2026cites this paper
OnomaCompass: A Texture Exploration Interface that Shuttles between Words and Images
2026cites this paper
Early-stage architecture design assistance by LLMs and knowledge graphs
2026cites this paper
QuaFT: Quality-guided semantic fault intelligence under corrupted industrial data
2026cites this paper
Contextual Fusion Strategies for Multimodal GNN-Based Reasoning: Performance and Computational Trade-Offs
2026cites this paper
Universal Scene Graph Generation
2025cites this paper
Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene
2025cites this paper
Towards Achieving Perfect Multimodal Alignment
2025cites this paper
Advancing Medical Representation Learning Through High-Quality Data
2025cites this paper
Code-Driven Inductive Synthesis: Enhancing Reasoning Abilities of Large Language Models with Sequences
2025cites this paper
Tracking Meets Large Multimodal Models for Driving Scenario Understanding
2025cites this paper
Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection
2025cites this paper
Ferret: An Efficient Online Continual Learning Framework under Varying Memory Constraints
2025cites this paper
How Can Time Series Analysis Benefit From Multiple Modalities? A Survey and Outlook
2025cites this paper
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing
2025cites this paper
Cross-Modal Learning for Music-to-Music-Video Description Generation
2025cites this paper
SUVAD: Semantic Understanding Based Video Anomaly Detection Using MLLM
2025influential citation
TikZero: Zero-Shot Text-Guided Graphics Program Synthesis
2025cites this paper
Long-Video Audio Synthesis with Multi-Agent Collaboration
2025cites this paper
EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
2025cites this paper
AudioX: A Unified Framework for Anything-to-Audio Generation
2025influential citation
MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment
2025influential citation
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering
2025cites this paper
DAVE: Diagnostic benchmark for Audio Visual Evaluation
2025cites this paper
Divide and Conquer Self-Supervised Learning for High-Content Imaging
2025cites this paper
Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
2025cites this paper
Vision-text Enhancement Network For Weakly Supervised Video Anomaly Detection
2025cites this paper
Continual Learning for Multiple Modalities
2025influential citation
Exploring Multimodal Perception in Large Language Models Through Perceptual Strength Ratings
2025cites this paper
Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition
2025cites this paper
FilmComposer: LLM-Driven Music Production for Silent Film Clips
2025cites this paper
Generalization abilities of foundation models in waste classification.
2025cites this paper
AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM
2025cites this paper