ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues

Hengcan Shi,Munawar Hayat,Yicheng Wu,Jianfei Cai

Published 2022 in Computer Vision and Pattern Recognition

ABSTRACT

Object proposal generation is an important and fundamental task in computer vision. In this paper, we propose ProposalCLIP, a method towards unsupervised open-category object proposal generation. Unlike previous works which require a large number of bounding box annotations and/or can only generate proposals for limited object categories, our ProposalCLIP is able to predict proposals for a large variety of object categories without annotations, by exploiting CLIP (contrastive language-image pre-training) cues. Firstly, we analyze CLIP for unsupervised open-category proposal generation and design an objectness score based on our empirical analysis on proposal selection. Secondly, a graph-based merging module is proposed to solve the limitations of CLIP cues and merge fragmented proposals. Finally, we present a proposal regression module that extracts pseudo labels based on CLIP cues and trains a lightweight network to further refine proposals. Extensive experiments on PASCAL VOC, COCO and Visual Genome datasets show that our ProposalCLIP can better generate proposals than previous state-of-the-art methods. Our ProposalCLIP also shows benefits for downstream tasks, such as unsupervised object detection.

PUBLICATION RECORD

Publication year
2022
Venue
Computer Vision and Pattern Recognition
Publication date
2022-01-18
Fields of study
Computer Science
Identifiers
DOI 10.1109/CVPR52688.2022.00939 arXiv 2201.06696
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Towards Open World Object Detection
2021cited by this paper
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
2021cited by this paper
DetCo: Unsupervised Contrastive Learning for Object Detection
2021cited by this paper
Localizing Objects with Self-Supervised Transformers and no Labels
2021influential reference
Connecting Language and Vision for Natural Language-Based Vehicle Retrieval
2021cited by this paper
Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation
2021cited by this paper
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
2021influential reference
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
2021cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021influential reference
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
2021cited by this paper
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval
2021cited by this paper
Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension
2020cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020cited by this paper
Diverse Image Captioning with Context-Object Split Latent Spaces
2020cited by this paper
Concept Generalization in Visual Representation Learning
2020cited by this paper
End-to-End Video Instance Segmentation with Transformers
2020cited by this paper
BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation
2020cited by this paper
High-Quality Proposals for Weakly Supervised Object Detection
2020cited by this paper
Open-Vocabulary Object Detection Using Captions
2020influential reference
End-to-End Object Detection with Transformers
2020cited by this paper
RFP-Net: Receptive field-based proposal generation network for object detection
2020cited by this paper
Refinedbox: Refining for fewer and high-quality object proposals
2020cited by this paper
Toward unsupervised, multi-object discovery in large-scale image collections
2020influential reference
PyTorch: An Imperative Style, High-Performance Deep Learning Library
2019cited by this paper
Unsupervised object discovery and co-localization by deep descriptor transformation
2019cited by this paper
Know More Say Less: Image Captioning Based on Scene Graphs
2019cited by this paper
Language Models are Unsupervised Multitask Learners
2019cited by this paper
Scene Parsing via Integrated Classification Model and Variance-Based Regularization
2019cited by this paper
Weakly Supervised Region Proposal Network and Object Detection
2018cited by this paper
MAttNet: Modular Attention Network for Referring Expression Comprehension
2018cited by this paper
PCL: Proposal Cluster Learning for Weakly Supervised Object Detection
2018cited by this paper
Hierarchical Parsing Net: Semantic Scene Parsing From Global Scene to Objects
2018cited by this paper
Key-Word-Aware Network for Referring Expression Image Segmentation
2018cited by this paper
Mask R-CNN
2017cited by this paper
Deep Self-Taught Learning for Weakly Supervised Object Localization
2017cited by this paper
Soft Proposal Networks for Weakly Supervised Object Localization
2017cited by this paper
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
2016cited by this paper
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
2016cited by this paper
You Only Look Once: Unified, Real-Time Object Detection
2015cited by this paper
DeepBox: Learning Objectness with Convolutional Networks
2015influential reference
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
2015influential reference
Natural Language Object Retrieval
2015cited by this paper
SSD: Single Shot MultiBox Detector
2015cited by this paper
Fast R-CNN
2015cited by this paper
Learning Deep Features for Discriminative Localization
2015cited by this paper
Sequential Optimization for Efficient High-Quality Object Proposal Generation
2015cited by this paper
Microsoft COCO: Common Objects in Context
2014influential reference
Edge Boxes: Locating Object Proposals from Edges
2014influential reference
BING: Binarized normed gradients for objectness estimation at 300fps
2014influential reference
Selective Search for Object Recognition
2013influential reference
Measuring the Objectness of Image Windows
2011cited by this paper

CITED BY

DST-Det: Open-Vocabulary Object Detection via Dynamic Self-Training
2025cites this paper
ManiNet: Manifold Network for Few-Shot Learning
2025cites this paper
Continual Egocentric Activity Recognition by Adversarial Feature Transform and Temporal Orthogonal Adaption
2025cites this paper
CIT: Context Interaction Transformer for Micro-Expression Recognition in Natural Scene
2025cites this paper
Efficient Incomplete Utterance Rewriting with Modern Convolutional Neural Networks
2025cites this paper
Learning a Reliable Graph Model for Micro-Expression Recognition
2025cites this paper
LMa-BMN: Local Motion-Aware Boundary Matching Network for Macro- and Micro-Expression Spotting
2025cites this paper
ME-Aware Cues Guided Spatiotemporal Neural Networks for Micro-Expression Recognition
2025cites this paper
Grid-level Indicator for Unlabeled Instance Discrimination in Few-shot Object Detection
2025cites this paper
Beyond the Noise: Aligning Prompts with Latent Representations in Diffusion Models
2025cites this paper
Structure-Aware Feature Rectification with Region Adjacency Graphs for Training-Free Open-Vocabulary Semantic Segmentation
2025cites this paper
Closing the Approximation Gap of Partial AUC Optimization: A Tale of Two Formulations
2025cites this paper
NaME: A Natural Micro-expression Dataset for Micro-expression Recognition in the Wild
2025cites this paper
MPT: Motion Prompt Tuning for Micro-Expression Recognition
2025cites this paper
Adpl: attentive dual-modality prompt learning for vision-language understanding
2025cites this paper
On Modulating Motion-Aware Visual-Language Representation for Few-Shot Action Recognition
2025cites this paper
DE-CLIP: Unsupervised Dense Counting Method Based on Multimodal Deep Sharing Prompts and Cross-Modal Alignment Ranking
2025cites this paper
CrowdCL: Unsupervised Crowd Counting Network via Contrastive Learning
2025cites this paper
ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models
2025cites this paper
Vision-Language Models Do Not Understand Negation
2025cites this paper
MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection
2024cites this paper
Commonsense Knowledge Prompting for Few-Shot Action Recognition in Videos
2024cites this paper
Test-Time Distribution Learning Adapter for Cross-Modal Visual Reasoning
2024cites this paper
TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection
2024cites this paper
Anomaly Detection by Adapting a pre-trained Vision Language Model
2024cites this paper
A Survey on Large Language Models from Concept to Implementation
2024cites this paper
Progressive Multi-modal Conditional Prompt Tuning
2024cites this paper
Few-shot Military Target Detection with Text-to-image and Vision-language Models
2024cites this paper
Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment
2024cites this paper
Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation
2024cites this paper
Beyond Mask: Rethinking Guidance Types in Few-shot Segmentation
2024cites this paper
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling
2024cites this paper
A Review of Human-Object Interaction Detection
2024cites this paper
AI-Enhanced Photo Authenticity: A User-Focused Approach to Detecting and Analyzing Manipulated Images
2024cites this paper
Beyond Seen Primitive Concepts and Attribute-Object Compositional Learning
2024cites this paper
Image-caption difficulty for efficient weakly-supervised object detection from in-the-wild data
2024cites this paper
PV-Cap: 3D Dynamic Scene Understanding Through Open Physics-based Vocabulary
2024cites this paper
CA-OVS: Cluster and Adapt Mask Proposals for Open-Vocabulary Semantic Segmentation
2024cites this paper
Homology Consistency Constrained Efficient Tuning for Vision-Language Models
2024cites this paper
Progressive Visual Prompt Learning with Contrastive Feature Re-formation
2023influential citation
DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection
2023cites this paper
What does CLIP know about a red circle? Visual prompt engineering for VLMs
2023cites this paper
SATR: Zero-Shot Semantic Segmentation of 3D Shapes
2023cites this paper
CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model
2023cites this paper
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement
2023cites this paper
VEIL: Vetting Extracted Image Labels from In-the-Wild Captions for Weakly-Supervised Object Detection
2023cites this paper
Weakly-Supervised HOI Detection from Interaction Labels Only and Language/Vision-Language Priors
2023cites this paper
Dynamic Texts From UAV Perspective Natural Images
2023influential citation
Trajectory Prediction with Contrastive Pre-training and Social Rank Fine-Tuning
2023cites this paper
Open-Scenario Domain Adaptive Object Detection in Autonomous Driving
2023cites this paper
Unseen And Adverse Outdoor Scenes Recognition Through Event-based Captions
2023cites this paper
GazeCLIP: Enhancing Gaze Estimation Through Text-Guided Multimodal Learning
2023cites this paper
Read, look and detect: Bounding box annotation from image-caption pairs
2023cites this paper
Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining
2023cites this paper
IPA-CLIP: Integrating Phonetic Priors into Vision and Language Pretraining
2023cites this paper
CLIP-guided Prototype Modulating for Few-shot Action Recognition
2023cites this paper
Aligning Bag of Regions for Open-Vocabulary Object Detection
2023cites this paper
Unsupervised Video Anomaly Detection Based on Similarity with Predefined Text Descriptions
2023cites this paper
Open-Vocabulary Object Detection via Scene Graph Discovery
2023cites this paper
SGDiff: A Style Guided Diffusion Model for Fashion Synthesis
2023cites this paper
Visual and Textual Prior Guided Mask Assemble for Few-Shot Segmentation and Beyond
2023cites this paper
Unsupervised Prototype Adapter for Vision-Language Models
2023cites this paper
Toward Transparent Deep Image Aesthetics Assessment With Tag-Based Content Descriptors
2023cites this paper
AAN: Attributes-Aware Network for Temporal Action Detection
2023cites this paper
BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning
2023cites this paper
Enhancing Discriminative Ability among Similar Classes with Guidance of Text-Image Correlation for Unsupervised Domain Adaptation
2023cites this paper
Investigating the Limitation of CLIP Models: The Worst-Performing Categories
2023cites this paper
Beyond Seen Primitive Concepts for Attributes-Objects Compositional Learning Anonymous
2023cites this paper
Semantically Enhanced Scene Captions with Physical and Weather Condition Changes
2023cites this paper
Nonword-to-Image Generation Considering Perceptual Association of Phonetically Similar Words
2023cites this paper
Exploiting Unlabeled Data with Vision and Language Models for Object Detection
2022cites this paper
CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable and Controllable Text-Guided Face Manipulation
2022cites this paper
Exploring CLIP for Assessing the Look and Feel of Images
2022cites this paper
Transformer Scale Gate for Semantic Segmentation
2022cites this paper
@ CREPE: Can Vision-Language Foundation Models Reason Compositionally?
2022cites this paper
Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment
2022cites this paper
Refined and Enriched Captions With Physical Scale For Dynamic Disaster Scene
year unknowncites this paper