Perceptual Grouping in Contrastive Vision-Language Models

Kanchana Ranasinghe,Brandon McKinzie,S. Ravi,Yinfei Yang,Alexander Toshev,Jonathon Shlens

Published 2022 in IEEE International Conference on Computer Vision

ABSTRACT

Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.

PUBLICATION RECORD

Publication year
2022
Venue
IEEE International Conference on Computer Vision
Publication date
2022-10-18
Fields of study
Computer Science
Identifiers
DOI 10.1109/ICCV51070.2023.00513 arXiv 2210.09996
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation
2023cited by this paper
Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision
2023cited by this paper
Diffusion Models for Open-Vocabulary Segmentation
2023cited by this paper
SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos
2022cited by this paper
Position Prediction as an Effective Pretraining Strategy
2022cited by this paper
Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space
2022cited by this paper
Patch-level Representation Learning for Self-supervised Vision Transformers
2022cited by this paper
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
2022cited by this paper
GLIPv2: Unifying Localization and Vision-Language Understanding
2022cited by this paper
Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?
2022cited by this paper
Self-Supervised Visual Representation Learning with Semantic Grouping
2022cited by this paper
Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization
2022cited by this paper
CoCa: Contrastive Captioners are Image-Text Foundation Models
2022cited by this paper
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension
2022cited by this paper
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
2022cited by this paper
Adapting CLIP For Phrase Localization Without Further Training
2022cited by this paper
Discovering Objects that Can Move
2022cited by this paper
Object discovery and representation networks
2022cited by this paper
Unsupervised Semantic Segmentation by Distilling Feature Correspondences
2022influential reference
Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision
2022cited by this paper
GroupViT: Semantic Segmentation Emerges from Text Supervision
2022cited by this paper
Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors
2022cited by this paper
Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning
2022cited by this paper
Scaling Language-Image Pre-Training via Masking
2022cited by this paper
Learning to Generate Text-Grounded Mask for Open-World Semantic Segmentation from Only Image-Text Pairs
2022cited by this paper
A Comprehensive Study of Image Classification Model Sensitivity to Foregrounds, Backgrounds, and Visual Attributes
2022cited by this paper
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation
2022cited by this paper
Language-driven Semantic Segmentation
2022influential reference
Unsupervised Semantic Segmentation with Self-supervised Object-centric Representations
2022cited by this paper
MDETR - Modulated Detection for End-to-End Multi-Modal Understanding
2021cited by this paper
Extract Free Dense Labels from CLIP
2021cited by this paper
A Closer Look at Self-training for Zero-Label Semantic Segmentation
2021cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021cited by this paper
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
2021cited by this paper
Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals
2021cited by this paper
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
2021influential reference
Revisiting ResNets: Improved Training and Scaling Strategies
2021cited by this paper
PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering
2021cited by this paper
SimCSE: Simple Contrastive Learning of Sentence Embeddings
2021cited by this paper
Open-Vocabulary Image Segmentation
2021influential reference
High-Resolution Image Synthesis with Latent Diffusion Models
2021cited by this paper
Decoupling Zero-Shot Semantic Segmentation
2021influential reference
Grounded Language-Image Pre-training
2021cited by this paper
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
2021cited by this paper
FILIP: Fine-grained Interactive Language-Image Pre-Training
2021cited by this paper
Masked Autoencoders Are Scalable Vision Learners
2021cited by this paper
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
2021cited by this paper
Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models
2021influential reference
Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation
2021cited by this paper
Self-Supervised Disentangled Representation Learning for Third-Person Imitation Learning
2021cited by this paper
Open World Entity Segmentation
2021cited by this paper
Per-Pixel Classification is Not All You Need for Semantic Segmentation
2021influential reference
Just Train Twice: Improving Group Robustness without Training Group Information
2021cited by this paper
Partial success in closing the gap between human and machine vision
2021cited by this paper
Conterfactual Generative Zero-Shot Semantic Segmentation
2021cited by this paper
Intriguing Properties of Vision Transformers
2021cited by this paper
Emerging Properties in Self-Supervised Vision Transformers
2021influential reference
Zero-Shot Detection via Vision and Language Knowledge Distillation
2021cited by this paper
WILDS: A Benchmark of in-the-Wild Distribution Shifts
2020cited by this paper
Language Models are Few-Shot Learners
2020cited by this paper
VirTex: Learning Visual Representations from Textual Annotations
2020cited by this paper
Noise or Signal: The Role of Image Backgrounds in Object Recognition
2020cited by this paper
Object-Centric Learning with Slot Attention
2020cited by this paper
Context-aware Feature Generation For Zero-shot Semantic Segmentation
2020cited by this paper
From Pixel to Patch: Synthesize Context-Aware Features for Zero-Shot Semantic Segmentation
2020cited by this paper
Cap2Seg: Inferring Semantic and Spatial Context from Captions for Zero-Shot Image Segmentation
2020cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020influential reference
Learning the Best Pooling Strategy for Visual Semantic Embedding
2020cited by this paper
Uncertainty-Aware Learning for Zero-Shot Semantic Segmentation
2020cited by this paper
Learning from Failure: De-biasing Classifier from Biased Classifier
2020cited by this paper
Consistent Structural Relation Learning for Zero-Shot Segmentation
2020cited by this paper
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation
2020cited by this paper
Environment Inference for Invariant Learning
2020cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
Invariant Risk Minimization
2019cited by this paper
Zero-Shot Semantic Segmentation
2019cited by this paper
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
2019cited by this paper
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
2019cited by this paper
Horse
2019cited by this paper
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
2019cited by this paper
Semantic Projection Network for Zero- and Few-Label Semantic Segmentation
2019cited by this paper
Do ImageNet Classifiers Generalize to ImageNet?
2019cited by this paper
Bipartite Conditional Random Fields for Panoptic Segmentation
2019cited by this paper
Scalability in Perception for Autonomous Driving: Waymo Open Dataset
2019cited by this paper
Causality
2019cited by this paper
Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization
2019influential reference
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
2019cited by this paper
PyTorch: An Imperative Style, High-Performance Deep Learning Library
2019influential reference
Deep Clustering for Unsupervised Learning of Visual Features
2018cited by this paper
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
2018cited by this paper
The Book of Why: The New Science of Cause and Effect
2018cited by this paper
Excessive Invariance Causes Adversarial Vulnerability
2018cited by this paper
Do Better ImageNet Models Transfer Better?
2018cited by this paper
End-to-End Joint Semantic Segmentation of Actors and Actions in Video
2018cited by this paper
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
2018cited by this paper
Invariant Information Clustering for Unsupervised Image Classification and Segmentation
2018cited by this paper
Non-local Neural Networks
2017cited by this paper
Attention is All you Need
2017cited by this paper
What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?
2017cited by this paper
Open Vocabulary Scene Parsing
2017cited by this paper

CITED BY

Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
2026cites this paper
Open-Vocabulary Domain Generalization in Urban-Scene Segmentation
2026cites this paper
Enabling Training-Free Text-Based Remote Sensing Segmentation
2026cites this paper
Robust Vision-Language Alignment Using Multi-Modal Large Language Models for Open-Vocabulary Semantic Segmentation
2026cites this paper
Online Monitoring Framework for Automotive Time Series Data using JEPA Embeddings
2026cites this paper
SHED Light on Segmentation for Dense Prediction
2026cites this paper
IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance
2026cites this paper
Insight: Interpretable Semantic Hierarchies in Vision-Language Encoders
2026cites this paper
Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers
2026cites this paper
Unveiling the Complementary Synergy of CLIP and Diffusion Models for Weakly Supervised Semantic Segmentation
2026cites this paper
A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP
2025cites this paper
Cross-View Open-Vocabulary Object Detection in Aerial Imagery
2025cites this paper
Unsupervised Transformer Pre-Training for Images: Self-Distillation, Mean Teachers, and Random Crops
2025cites this paper
PnP-SAM: Enhancing Open-Vocabulary Semantic Segmentation Through Hierarchical Aggregation and Prompt Optimization
2025cites this paper
Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations
2025cites this paper
Test-Time Optimization for Domain Adaptive Open Vocabulary Segmentation
2025influential citation
Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing
2025cites this paper
microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification
2025cites this paper
LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation
2025cites this paper
Falcon: Fractional Alternating Cut with Overcoming Minima in Unsupervised Segmentation
2025cites this paper
Scaling Laws for Native Multimodal Models
2025cites this paper
FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation
2025cites this paper
Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models
2025cites this paper
Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation
2025cites this paper
SAB3R: Semantic-Augmented Backbone in 3D Reconstruction
2025influential citation
FastSeg: Efficient Training-Free Open-Vocabulary Segmentation via Hierarchical Attention Refinement Method
2025cites this paper
VidSeg: Training-free Video Semantic Segmentation based on Diffusion Models
2025cites this paper
FA-Seg: A fast and accurate diffusion-based method for open-vocabulary segmentation
2025cites this paper
Meta CLIP 2: A Worldwide Scaling Recipe
2025cites this paper
Training-Free Class Purification for Open-Vocabulary Semantic Segmentation
2025cites this paper
Annotation-Free Open-Vocabulary Segmentation for Remote-Sensing Images
2025cites this paper
Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation
2025cites this paper
One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
2025cites this paper
NERVE: Neighbourhood & Entropy-guided Random-walk for training free open-Vocabulary sEgmentation
2025cites this paper
Moving Off-the-Grid: Scene-Grounded Video Representations
2024cites this paper
Improving fine-grained understanding in image-text pre-training
2024cites this paper
EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models
2024influential citation
Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision
2024cites this paper
SInViG: A Self-Evolving Interactive Visual Agent for Human-Robot Interaction
2024cites this paper
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
2024cites this paper
TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias
2024cites this paper
Is Clip the Main Roadblock for Fine-Grained Open-World Perception?
2024cites this paper
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
2024influential citation
HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts
2024cites this paper
Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models
2024cites this paper
DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut
2024cites this paper
Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA
2024cites this paper
A Simple Framework for Open-Vocabulary Zero-Shot Segmentation
2024cites this paper
PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers
2024cites this paper
Test-time Contrastive Concepts for Open-world Semantic Segmentation with Vision-Language Models
2024cites this paper
DEAL: Disentangle and Localize Concept-level Explanations for VLMs
2024cites this paper
Large-vocabulary forensic pathological analyses via prototypical cross-modal contrastive learning
2024cites this paper
GLASS: Guided Latent Slot Diffusion for Object-Centric Learning
2024cites this paper
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation
2024influential citation
Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation
2024cites this paper
Image Segmentation in Foundation Model Era: A Survey
2024cites this paper
Generalization Boosted Adapter for Open-Vocabulary Segmentation
2024cites this paper
iSeg: An Iterative Refinement-based Framework for Training-free Segmentation
2024cites this paper
Large-Scale Visual Language Model Boosted by Contrast Domain Adaptation for Intelligent Industrial Visual Monitoring
2024cites this paper
Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels
2024cites this paper
SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images
2024cites this paper
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements
2024cites this paper
CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation
2024cites this paper
Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
2024cites this paper
Grounding Descriptions in Images informs Zero-Shot Visual Recognition
2024cites this paper
Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction
2024cites this paper
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment
2024influential citation
FineCLIP: Self-distilled Region-based CLIP for Better Fine-grained Understanding
2024cites this paper
Masked Autoencoding Does Not Help Natural Language Supervision at Scale
2023cites this paper
TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification
2023cites this paper
CLIP-DINOiser: Teaching CLIP a few DINO tricks
2023influential citation
SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
2023cites this paper
Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision
2023influential citation
Emergent Open-Vocabulary Semantic Segmentation from Off-the-Shelf Vision-Language Models
2023cites this paper
Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation
2023cites this paper
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding
2023influential citation
CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement
2023cites this paper
SILC: Improving Vision Language Pretraining with Self-Distillation
2023cites this paper
Data Filtering Networks
2023cites this paper
CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free
2023influential citation
Diffusion Model is Secretly a Training-Free Open Vocabulary Semantic Segmenter
2023cites this paper
Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised Semantic Segmentation
2023cites this paper
Language-based Action Concept Spaces Improve Video Self-Supervised Learning
2023cites this paper
Diffusion Models for Open-Vocabulary Segmentation
2023cites this paper
SHED Light on Segmentation for Depth Estimation
year unknowncites this paper