Pre-trained vision-language models (VLMs) like CLIP have demonstrated impressive zero-shot performance on a wide range of downstream computer vision tasks. However, there still exists a considerable performance gap between these models and a supervised deep model trained on a downstream dataset. To bridge this gap, we propose a novel active learning (AL) framework that enhances the zero-shot classification performance of VLMs by selecting only a few informative samples from the unlabeled data for annotation during training. To achieve this, our approach first cali-brates the predicted entropy of VLMs and then utilizes a combination of self-uncertainty and neighbor-aware uncer-tainty to calculate a reliable uncertainty measure for active sample selection. Our extensive experiments show that the proposed approach outperforms existing AL approaches on several image classification datasets, and significantly en-hances the zero-shot performance of VLMs.
Active Learning for Vision-Language Models
Published 2024 in IEEE Workshop/Winter Conference on Applications of Computer Vision
ABSTRACT
PUBLICATION RECORD
- Publication year
2024
- Venue
IEEE Workshop/Winter Conference on Applications of Computer Vision
- Publication date
2024-10-29
- Fields of study
Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-63 of 63 references · Page 1 of 1
CITED BY
Showing 1-14 of 14 citing papers · Page 1 of 1