Open-Set Object Counting
- Open-set object counting is a task for estimating counts of arbitrarily specified object classes using exemplars, text, or both.
- It employs strategies such as exemplar-based matching, self-organizing methods, and density map regression to achieve robust performance.
- The field leverages multi-modal fusion and foundation model techniques to address challenges like occlusion, semantic ambiguity, and domain transfer.
Open-set object counting, also known as open-world or class-agnostic object counting, refers to the task of estimating the number of object instances of arbitrary, potentially unseen categories within images (or videos), where the target class may be specified by exemplars, bounding boxes, masks, or textual prompts—even in categories not present during training. This paradigm transcends the limitations of traditional closed-set counting, which restricts inference to a predefined set of object classes, and has emerged as a critical challenge for robust, scalable visual recognition systems.
1. Formal Task Definition and Foundational Principles
Open-set object counting requires the model to generate an accurate count of instances for any user-specified category, given only an image and a prompt encoding the target class. The prompt modality can be:
- A set of reference exemplars (bounding boxes or crops)
- A free-form text description
- Both visual exemplars and text (multi-modal fusion)
The general goal is to produce a count estimate
without explicit per-category model retraining. This broad scope includes:
- Reference-based counting (few-shot, guided by exemplars)
- Reference-less counting (unsupervised or dominant class discovery)
- Open-world, text-guided counting (language-driven, category-unconstrained) (Ciampi et al., 31 Jan 2025)
Open-set counting is evaluated on benchmarks such as FSC-147 (147 object classes, single-class per image, with boxed exemplars and dot maps), CARPK (vehicle counting, drone imagery), and omnibus datasets like OmniCount-191 and VideoCount (Mondal et al., 8 Mar 2024, Amini-Naieni et al., 18 Jun 2025).
2. Methodological Paradigms
Open-set object counting research organizes approaches into three principal categories (Ciampi et al., 31 Jan 2025):
(a) Reference-Based Counting
These methods receive exemplars that visually encode the novel class. The architectures fuse image features with exemplar features (via cross-correlation, bilinear attention, or pixel-wise similarity) and decode instance density maps:
- Two-stream backbones (image and exemplar streams) with feature matching (e.g., BMNet+, FamNet)
- Mask-augmented pipelines (MACnet: replaces boxes with high-fidelity masks)
- Test-time adaptation losses (perturbation consistency, Min-Count) (Ciampi et al., 31 Jan 2025)
(b) Reference-Less (Self-Organizing) Counting
These systems discover and select exemplars from the dominant repeating pattern in the input:
- Self-attention repetitive RPNs predict region proposals with repetition scores (RepRPN-Counter) (Ranjan et al., 2022)
- Clustering-based unsupervised models group generic object detections by learned similarity (SIMCO) (Godi et al., 2019)
- Weakly and self-supervised approaches regress counts or density maps without explicit object class knowledge or exemplars (Ciampi et al., 31 Jan 2025)
(c) Open-World Text-Guided Counting
With the availability of large vision-LLMs, these methods link natural language prompts to visual regions:
- Language-image fused counting networks (CounTX, CLIP-Count, VLCounter) (Amini-Naieni et al., 2023, Zhao et al., 24 Apr 2025)
- Open-vocabulary detectors (GroundingDINO) as counting proxies (CountGD, YOLO-Count) (Amini-Naieni et al., 5 Jul 2024, Zeng et al., 1 Aug 2025)
- Training-free and prompt-based approaches using foundation models (SAM, CLIP) for segmentation and matching (Shi et al., 2023, Mondal et al., 8 Mar 2024)
3. Model Architectures and Algorithmic Innovations
Recent advances integrate various foundational architectures and prompt-fusion strategies:
- Feature Fusion: Cross-attention, bilinear matching, and prompt conditioning modules optimally align exemplar/image/text features (Zhao et al., 24 Apr 2025, Amini-Naieni et al., 5 Jul 2024).
- Density Map Regression: Output is often a high-resolution density map, with the count given by integral over this map (post-processed to integers) (Xu et al., 2023, Amini-Naieni et al., 2023).
- Cardinality Maps: Alternative to density, YOLO-Count regresses uniform “cardinality” spreads over instance regions, improving control and integration with generative models (Zeng et al., 1 Aug 2025).
- Prompt Tuning: Visual prompt banks and dynamic prompt synthesis, e.g., Semantic-Driven Visual Prompt Tuning (SDVPT), bridge seen/unseen categories by aligning the geometry of learned prompts with text-embedding topology (Zhao et al., 24 Apr 2025).
- Self-Attention Region Proposal: RepRPN introduces attention-based proposal generators trained to predict repetition counts, supporting exemplar-free counting (Ranjan et al., 2022).
- Occlusion Handling: Amodal counting architectures such as CountOCC reconstruct occluded object features by hierarchical multi-modal guidance and attention-space equivalence (Arib et al., 16 Nov 2025).
- Training-Free Segmentation: Many methods leverage frozen foundation models (SAM, CLIP) to generate instance masks and compose the final count by mask disambiguation (Shi et al., 2023, Mondal et al., 8 Mar 2024, Ting et al., 12 Mar 2024).
Table: Typical Prompt Modalities and Model Types
| Prompt Mode | Example Method | Key Component |
|---|---|---|
| Box/crop exemplars | BMNet+, RepRPN | Exemplar-image fusion |
| Text description | CounTX, SDVPT | Language-image fusion |
| Text + exemplars | CountGD, CountOCC | Multi-modal attention |
| No prompt | SIMCO, RepRPN | Unsupervised discovery |
4. Benchmarks, Evaluation Metrics, and Empirical Results
Open-set counting methods are rigorously evaluated using absolute and relative errors:
- MAE: Mean Absolute Error
- RMSE: Root Mean Square Error
- NAE: Normalized Absolute Error, SRE: Sqrt. of normalized L2 error
Key findings:
- Multi-modal models (CountGD, YOLO-Count) using both text and exemplars set the current state-of-the-art on FSC-147 (MAE ≈11–12 for text+exemplar) and strong performance on CARPK (MAE ≈4–5) (Amini-Naieni et al., 5 Jul 2024, Zeng et al., 1 Aug 2025).
- Training-free, segmentation-driven approaches (TFCounter, OmniCount) exceed prior zero-shot methods and are competitive with fully trained counterparts, particularly in multi-label and dense scenarios (Ting et al., 12 Mar 2024, Mondal et al., 8 Mar 2024).
- Text-guided only counting, though extremely flexible, has higher error (e.g., CounTX, MAE~15.7 on FSC-147) due to semantic ambiguity (Amini-Naieni et al., 2023).
- Amodal counting frameworks such as CountOCC improve occlusion-robustness, yielding MAE reductions up to 26.7% on occlusion-benchmarks (Arib et al., 16 Nov 2025).
- Annotation-free approaches (AFreeCA) leverage synthetic data, achieving state-of-the-art unsupervised performance by combining sorting signal, approximate counts, and patch-based inference (D'Alessandro et al., 7 Mar 2024).
5. Limitations, Current Challenges, and Failure Modes
Despite rapid progress, open-set object counting faces several persistent challenges:
- Semantic Ambiguity and Grounding: Text-only systems are limited by the frozen language encoder’s inability to fully disentangle semantically close categories or handle compositional queries ("green apples" vs. "apples") (Zhao et al., 24 Apr 2025).
- Scale and Density Generalization: Extreme object densities lead to merged masks (under-count) or over-segmentation (double-count), especially with training-free or naive segmentation backbones (Shi et al., 2023, Ting et al., 12 Mar 2024).
- Small or Occluded Instances: Tiny, distant, or highly occluded objects often evade current proposals or segmentation prompts, limiting recall in realistic scenes (Mondal et al., 8 Mar 2024, Arib et al., 16 Nov 2025).
- Cross-Domain Transfer: Although recent models such as CountGD, YOLO-Count report improved generalization to new domains (e.g., vehicles, fruits, medical), synthetic-to-real and fine-grained domain adaptation remain open problems (Amini-Naieni et al., 5 Jul 2024, Zeng et al., 1 Aug 2025).
- Computational Overhead: Iterative mask decoding (in segmentation-driven methods) and multi-stage spatial division can substantially increase inference time on dense or large-scale images (Ting et al., 12 Mar 2024, Xiong et al., 2019).
- Lack of Uncertainty Estimates: Standard evaluation restricts to point error metrics; uncertainty quantification and error bars remain rare (Amini-Naieni et al., 5 Jul 2024).
6. Emerging Directions and Future Research
The field is moving towards:
- Plug-and-Play and Adapter Modules: Lightweight prompt tuning modules that align with both visual and textual structures, enabling continuous adaptation without retraining the entire model (Zhao et al., 24 Apr 2025).
- Richer Prompt Languages: Support for open-ended queries (multi-label, complex attribute constraints) and video object counting via natural language (Amini-Naieni et al., 18 Jun 2025).
- Hybrid Detection-Regression Pipelines: Integrations like DAVE and CountGD blend proposal-based detection with density map regression for improved performance in both sparse and crowded regions (Amini-Naieni et al., 5 Jul 2024, Ciampi et al., 31 Jan 2025).
- Synthetic Data and Unsupervised Training: Models like AFreeCA demonstrate the utility of generative models for annotation-free learning and cross-category adaptation (D'Alessandro et al., 7 Mar 2024).
- Amodal and 3D Counting: Open-set frameworks are moving toward robust amodal counting (occluded, partially visible objects) and multi-view/3D-awareness for volumetric domains (Arib et al., 16 Nov 2025, Amini-Naieni et al., 18 Jun 2025).
- Benchmark Development: New benchmarks (OmniCount-191, VideoCount) are designed to test multi-label, multi-domain, and long-tail open-set performance (Mondal et al., 8 Mar 2024, Amini-Naieni et al., 18 Jun 2025).
7. Theoretical Underpinnings: Open-Set to Closed-Set via Decomposition
Spatial divide-and-conquer paradigms (S-DCNet, SS-DCNet) formally argue that, for decomposable tasks like counting, open-set generalization can be reduced to repeated closed-set inference on subdivided regions. If a local region’s count exceeds the training range, recursive subdivision ensures all predictions occur within the regime seen during training, and the sum restores global coverage (Xiong et al., 2019, Xiong et al., 2020). This approach is mathematically justified to reduce error in high-density patches and has demonstrated empirical success in both crowd and vehicle domains.
Open-set object counting synthesizes advances from few-shot learning, vision-LLMs, prompt engineering, and foundational detection/segmentation architectures. The field’s ongoing focus is on prompt-flexibility, generalization under distribution shift, and the elimination of reliance on intensive category-level annotation—all fundamental barriers for scalable scene understanding (Ciampi et al., 31 Jan 2025, Amini-Naieni et al., 5 Jul 2024, Mondal et al., 8 Mar 2024).