Multi-Modal Open-World Counting

Updated 31 December 2025

Multi-modal open-world counting is a framework that uses flexible text descriptions and visual exemplars to specify targets dynamically in images or videos.
It integrates vision–language modeling and transformer-based architectures to support category-agnostic counting in cluttered, occluded, and ambiguous environments.
Evaluation metrics like MAE, Acc, and RMSE on benchmarks such as FSC-147 highlight both its promising results and its challenges in high-density and ambiguous scenarios.

Multi-modal open-world counting refers to the automated enumeration of object instances in images or videos, where the set of target object categories is not fixed in advance and the specification of “what to count” is provided at inference time through flexible multi-modal prompts: textual descriptions, visual exemplars, or both. This paradigm integrates vision–language modeling, prompt engineering, and robust visual reasoning to enable category-agnostic, user-driven, and context-adaptive counting across arbitrary domains and densely cluttered, occluded, or visually ambiguous scenes. The field encompasses zero-shot, few-shot, and open-set settings, and spans techniques from pure prompt-driven large vision–LLMs (VLMs) to transformer-based detectors augmented with advanced fusion and adaptation modules.

1. Problem Definition and Metrics

The open-world counting problem is formally posed as follows: given an image $I$ (or video $V$ ) and a target category $T$ specified by text, visual exemplars, or both, predict the number of visually discernible instances $C$ of $T$ in $I$ . The model's estimate is denoted $\hat{C}$ . Counting is considered open-world when $T$ may be drawn from an unbounded vocabulary that is either unseen during training or specified only by a prompt at test time (Hou et al., 17 Dec 2025, Amini-Naieni et al., 2024, Amini-Naieni et al., 2023, Amini-Naieni et al., 29 Dec 2025).

Evaluation metrics include:

Mean Absolute Error (MAE):

$\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^N |C_i - \hat C_i|$

where $N$ is the number of test instances.

Enumeration Accuracy (Acc):

$\mathrm{Acc} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}[\hat{C}_i = C_i]$

measuring the fraction of images with perfectly correct counts.

Root Mean Square Error (RMSE) and related dataset-specific or multi-label metrics (e.g., mRMSE for multi-label counts in OmniCount).

Typical experimental scopes limit $C \leq 40$ to avoid regimes where both human and model subitizing fail (Hou et al., 17 Dec 2025).

2. Architectural Approaches

Approaches to multi-modal open-world counting fall into the following technical classes:

Specialized Counting Architectures: Traditional models such as PseCo, TFOC, and T2ICount rely on domain-specific representations—e.g., point-level supervision, SAM-inspired segmentation decoders, clustering of mask centroids, or direct regression on global features. These models are typically constrained to a fixed vocabulary and are brittle under open-vocabulary or cluttered conditions (Hou et al., 17 Dec 2025).
Vision–LLMs (VLMs): Large-scale multi-modal transformers (e.g., Claude 4.5 Sonnet, Gemini 2.5 Pro, GPT-5, Qwen3-VL) perform counting in a zero-shot regime by jointly processing visual and textual inputs. VLMs exhibit strong transfer to unseen classes, with enumeration accuracy improving when prompted to generate intermediate representations such as locations (bounding boxes) and labels for each counted item (Hou et al., 17 Dec 2025, Füzesséry et al., 2 Dec 2025).
Unified Open-World Counting Models:
- CountGD/CountGD++ extend open-vocabulary foundation models (GroundingDINO+Swin–BERT) with promptable fusion modules, allowing target specification by text, exemplars, or combinations, and supporting positive (“what to count”) and negative (“what not to count”) constraints (Amini-Naieni et al., 2024, Amini-Naieni et al., 29 Dec 2025).
- OmniCount applies a training-free semantic-geometric refinement pipeline combining Side Adapter Networks (CLIP-based), monocular depth priors (Marigold), and Segment Anything Model (SAM)-guided instance mask generation, supporting multi-label counting of arbitrary text-specified categories (Mondal et al., 2024).
- CounTX implements a single-stage, end-to-end transformer decoder built on pre-trained joint text-image representations (CLIP ViT–B/16), directly regressing density maps from free-form description + image pairs (Amini-Naieni et al., 2023).
Prompt Engineering and Adaptation:
- SDVPT applies semantic-driven visual prompt tuning to VLMs, aligning visual prompt geometry with the text embedding topology for robust zero-shot transfer (Zhao et al., 24 Apr 2025).

These systems are complemented by architectures targeting complex reasoning (e.g., Relational Counting Network in TallyQA (Acharya et al., 2018)) and by adaptations for amodal (occluded) counting (CountOCC (Arib et al., 16 Nov 2025)) and long-form video inputs (CountVid (Amini-Naieni et al., 18 Jun 2025)).

3. Prompt Modalities and Fusion Mechanisms

Prompt specification in multi-modal open-world counting utilizes three main modalities:

Text Prompts: Natural-language object or attribute descriptors, tokenized and embedded by BERT, CLIP, or similar encoders; these allow access to unseen, arbitrary, or fine-grained categories (Amini-Naieni et al., 2024, Amini-Naieni et al., 2023).
Visual Exemplars: Cropped patches or bounding boxes of the target object, drawn from either the current or external images, processed via shared image encoders and projected to compact “exemplar tokens” (Amini-Naieni et al., 2024, Amini-Naieni et al., 29 Dec 2025).
Negative Prompts: (CountGD++): Explicit instructions of "what not to count," specified via negative text and/or negative exemplars, enabling fine-grained, disambiguated counting (Amini-Naieni et al., 29 Dec 2025).

Fusion pipelines consist of:

Transformer-based feature enhancers: self-attention over (text + exemplars) and cross-attention with image features.
Late-stage cross-modality decoders: top-k query selection by prompt similarity, hungarian-matched confidence matrices, instance mask or bounding box heads, and density-map regression.
For amodal and occlusion-robust models (CountOCC), hierarchically reconstructing occluded features via pyramid-level FRMs, spatial context from visible fragments, and visual equivalence losses enforcing attention alignment across occluded/unoccluded pairs (Arib et al., 16 Nov 2025).

Advanced adaptation techniques include pseudo-exemplar bootstrapping (automatically harvesting exemplars from initial text-only passes) and LLM-synthesized exemplars for rare classes (Amini-Naieni et al., 29 Dec 2025).

4. Benchmarks and Datasets

Key datasets underpinning the evaluation of multi-modal open-world counting:

Dataset	Modalities	Categories	Annotation	Density/Clutter	Notable Use
FSC-147	Exemplar+Text	147	Points	Varied	Standard open-world image counting (Hou et al., 17 Dec 2025, Amini-Naieni et al., 2024)
FSCD-LVIS	Text	800+	Points	Extreme	Long-tailed, occlusion & ambiguity (Hou et al., 17 Dec 2025)
SolidCount	N/A (syn.)	8 Shapes × 8 Colors	Points	Synthetic, controlled	Visual factor analysis (shape, color, clutter) (Hou et al., 17 Dec 2025)
OmniCount-191	Text/Multi	191	Points, Boxes, Segm, VQA	Urban, satellite, retail, etc.	Multi-label, open-vocab (Mondal et al., 2024)
CARPK; PUCPR+	Text/Exemplar	1	Boxes	Densely parked cars	Drone aerial, domain transfer (Zhao et al., 24 Apr 2025)
VideoCount	Text/Exemplar	100+	Per-frame tracks	Crowded, scientific	Video instance-level, occlusion (Amini-Naieni et al., 18 Jun 2025)
FSC-147-D	Text	147	Free-form desc	Varied	Text prompt variation (Amini-Naieni et al., 2023)
TallyQA	Text	VQA splits	QA	Simple/Complex	Attribute/relational queries (Acharya et al., 2018)

5. Quantitative and Qualitative Results

Recent findings highlight:

VLMs (Gemini 2.5 Pro, GPT-5) with structured prompting (“point, label, & count”) achieve SOTA MAE and accuracy on FSC-147 and SolidCount, outperforming specialized architectures, with MAE as low as 2.43 on FSC-147 (Gemini) and structured prompt enumeration accuracy up to 0.80 (SolidCount, Gemini) (Hou et al., 17 Dec 2025).
Specialist counters, e.g. PseCo, exhibit catastrophic failure under high background clutter (SolidCount, checkerboard: MAE explodes from 101 to 406), while VLMs and unified multi-modal models are largely robust.
Ablations confirm that enumeration accuracy in VLMs is significantly boosted by inducing the generation of intermediate object locations and labels—an explicit serial counting strategy.
Generalization beyond human subitizing range ( $C>40$ ) remains challenging for all models.
Negative prompt specification and iterative pseudo-exemplar selection in CountGD++ enhance discrimination and reduce false positives, yielding improvements for both images and cross-frame video tasks (Amini-Naieni et al., 29 Dec 2025).
Occlusion-robust frameworks (CountOCC) deliver up to 26.7% and 49.9% MAE reduction on occluded FSC-147 and CARPK, respectively (Arib et al., 16 Nov 2025).
Multi-label frameworks (OmniCount) enable efficient single-pass, open-vocabulary counting for hundreds of categories with competitive or better mRMSE than prior methods (e.g., 0.70 on OmniCount-191) (Mondal et al., 2024).

Examples of model-specific performance:

Model	FSC-147 MAE (Test)	SolidCount Acc	FSCD-LVIS MAE
Gemini (VLM)	2.43	0.80	14.66
PseCo (specialist)	31.77	0.06	108.46
GPT-5 (VLM)	3.46	0.21	15.86
CountGD (text+ex)	5.74 (corr.)	–	–

6. Limitations and Open Challenges

VLMs and multi-modal counting systems degrade significantly for highly ambiguous categories, heavy occlusion, or dense, repetitive patterns (e.g., thousands of Lego pieces) (Füzesséry et al., 2 Dec 2025, Hou et al., 17 Dec 2025).
Contemporary approaches struggle with counts beyond the subitizing regime ( $C>40$ ), with enumeration accuracy dropping sharply (Hou et al., 17 Dec 2025).
Negative prompt mechanisms require clear semantic distinctions; ambiguous or composite negative prompts may reduce recall or misclassify borderline cases (Amini-Naieni et al., 29 Dec 2025).
Occlusion amodal counting models (CountOCC) are optimized for total counts rather than pixel-exact localization, so dense maps under occluders may not precisely match real object centers (Arib et al., 16 Nov 2025).
Prompt- and LLM-based counting is sensitive to prompt construction, including verbosity, description specificity, and the inclusion of structured enumeration instructions (Füzesséry et al., 2 Dec 2025).
Models relying solely on text prompts often cannot resolve fine-scale visual ambiguities or attribute-based distinctions, motivating multi-modal fusion and promptable patch guidance (Amini-Naieni et al., 2024, Zhao et al., 24 Apr 2025).
Pseudo- and synthetic-exemplar pipelines depend on accurate initial detections and style-aligned exemplar generation, respectively; catastrophic first-pass errors can propagate through the iterative process (Amini-Naieni et al., 29 Dec 2025).

7. Future Directions

Ongoing and suggested research includes:

Construction of new open-world counting benchmarks with clear, unambiguous annotations, tunable visual complexity, and dual video–image modes (Hou et al., 17 Dec 2025, Amini-Naieni et al., 18 Jun 2025).
Integration of explicit symbolic counting procedures or memory buffers within VLMs or transformer decoders to more closely emulate human enumeration (Hou et al., 17 Dec 2025).
Systematic development and optimization of prompt engineering strategies (e.g., chain-of-thought, iterative refinement) for visual enumeration tasks (Hou et al., 17 Dec 2025, Füzesséry et al., 2 Dec 2025).
End-to-end training of unified architectures that jointly fuse prompt specification, detection, segmentation, tracking, and counting in both static and dynamic settings (Amini-Naieni et al., 18 Jun 2025).
Research into uncertainty quantification for open-world counts, critical for deployment in ambiguous or critical domains (Amini-Naieni et al., 2024).
Advances in semantic-driven visual prompt tuning (e.g., SDVPT), non-linear prompt aggregation, and cross-modality attention as means to robustly transfer knowledge from seen to unseen categories (Zhao et al., 24 Apr 2025).
Incorporation of efficient and robust modalities (super-resolution, low-light enhancement, improved monocular/multi-view depth) for enhanced counting under challenging conditions (Mondal et al., 2024).

In summary, multi-modal open-world counting synthesizes advances in vision–language modeling, prompt adaptation, robust enumeration, and multi-modal scene understanding to provide scalable, category-agnostic, and context-adaptive solutions for counting in unconstrained real-world environments. Continued progress depends on principled integration of multi-modal prompts, robust fusion architectures, and open-domain evaluation methodologies.