CountGD++: Generalized Prompting for Open-World Counting

Published 2025 in arXiv.org

ABSTRACT

The flexibility and accuracy of methods for automatically counting objects in images and videos are limited by the way the object can be specified. While existing methods allow users to describe the target object with text and visual examples, the visual examples must be manually annotated inside the image, and there is no way to specify what not to count. To address these gaps, we introduce novel capabilities that expand how the target object can be specified. Specifically, we extend the prompt to enable what not to count to be described with text and/or visual examples, introduce the concept of `pseudo-exemplars'that automate the annotation of visual examples at inference, and extend counting models to accept visual examples from both natural and synthetic external images. We also use our new counting model, CountGD++, as a vision expert agent for an LLM. Together, these contributions expand the prompt flexibility of multi-modal open-world counting and lead to significant improvements in accuracy, efficiency, and generalization across multiple datasets. Code is available at https://github.com/niki-amini-naieni/CountGDPlusPlus.

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-12-29
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2512.23351 arXiv 2512.23351
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation
2025cited by this paper
Open-World Object Counting in Videos
2025influential reference
Seed1.5-VL Technical Report
2025cited by this paper
Exploring Contextual Attribute Density in Referring Expression Counting
2025cited by this paper
Qwen2.5-VL Technical Report
2025cited by this paper
Vision-Language Models Do Not Understand Negation
2025cited by this paper
Demystifying Numerosity in Diffusion Models - Limitations and Remedies
2025cited by this paper
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
2024cited by this paper
Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation
2024cited by this paper
A Novel Unified Architecture for Low-Shot Counting by Detection and Segmentation
2024cited by this paper
Mind the Prompt: A Novel Benchmark for Prompt-Based Class-Agnostic Counting
2024cited by this paper
OmniCount: Multi-label Object Counting with Semantic-Geometric Priors
2024cited by this paper
Referring Expression Counting
2024cited by this paper
The Llama 3 Herd of Models
2024cited by this paper
Negative Prompt Driven Complementary Parallel Representation for Open-World 3D Object Retrieval
2024cited by this paper
CountGD: Multi-Modal Open-World Counting
2024influential reference
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
2024cited by this paper
Understanding the Impact of Negative Prompts: When and How Do They Take Effect?
2024cited by this paper
DAVE – A Detect-and-Verify Paradigm for Low-Shot Counting
2024cited by this paper
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
2024cited by this paper
Learning Transferable Negative Prompts for Out-of-Distribution Detection
2024cited by this paper
Open-world Text-specified Object Counting
2023cited by this paper
Zero-Shot Object Counting
2023cited by this paper
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
2023cited by this paper
ViperGPT: Visual Inference via Python Execution for Reasoning
2023cited by this paper
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
2023cited by this paper
Segment Anything
2023cited by this paper
CLIP-Count: Towards Text-Guided Zero-Shot Object Counting
2023cited by this paper
Modular Visual Question Answering via Code Generation
2023cited by this paper
Scaling Open-Vocabulary Object Detection
2023cited by this paper
Training-free Object Counting with Prompts
2023cited by this paper
Point, Segment and Count: A Generalized Framework for Object Counting
2023cited by this paper
VLCounter: Text-aware Visual Representation for Zero-Shot Object Counting
2023cited by this paper
Simple Open-Vocabulary Object Detection with Vision Transformers
2022cited by this paper
A Low-Shot Object Counting Network With Iterative Prototype Adaptation
2022cited by this paper
CounTR: Transformer-based Generalised Visual Counting
2022influential reference
Few-shot Object Counting and Detection
2022influential reference
Learning To Count Everything
2021influential reference
An accurate car counting in aerial images based on convolutional neural networks
2021cited by this paper
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
2021cited by this paper
Towards Open World Object Detection
2021cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021cited by this paper
Completely Self-Supervised Crowd Counting via Distribution Matching
2020cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
Microscopy cell counting and detection with fully convolutional regression networks
2018cited by this paper
Mask R-CNN
2017cited by this paper
Single-Image Crowd Counting via Multi-Column Convolutional Neural Network
2016cited by this paper
Counting in the Wild
2016cited by this paper
A Large Contextual Dataset for Classification, Detection and Counting of Cars with Deep Learning
2016cited by this paper
Towards Open World Recognition
2014cited by this paper
Learning to Count Cells: Applications to lens-free imaging of large fields
2011cited by this paper
Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval
2007cited by this paper
Learning from User Feedback in Image Retrieval Systems
1999cited by this paper

CITED BY

No citing papers are available for this paper.