Papers
Topics
Authors
Recent
2000 character limit reached

Multi-Modal Open-World Counting

Updated 31 December 2025
  • Multi-modal open-world counting is a framework that uses flexible text descriptions and visual exemplars to specify targets dynamically in images or videos.
  • It integrates vision–language modeling and transformer-based architectures to support category-agnostic counting in cluttered, occluded, and ambiguous environments.
  • Evaluation metrics like MAE, Acc, and RMSE on benchmarks such as FSC-147 highlight both its promising results and its challenges in high-density and ambiguous scenarios.

Multi-modal open-world counting refers to the automated enumeration of object instances in images or videos, where the set of target object categories is not fixed in advance and the specification of “what to count” is provided at inference time through flexible multi-modal prompts: textual descriptions, visual exemplars, or both. This paradigm integrates vision–language modeling, prompt engineering, and robust visual reasoning to enable category-agnostic, user-driven, and context-adaptive counting across arbitrary domains and densely cluttered, occluded, or visually ambiguous scenes. The field encompasses zero-shot, few-shot, and open-set settings, and spans techniques from pure prompt-driven large vision–LLMs (VLMs) to transformer-based detectors augmented with advanced fusion and adaptation modules.

1. Problem Definition and Metrics

The open-world counting problem is formally posed as follows: given an image II (or video VV) and a target category TT specified by text, visual exemplars, or both, predict the number of visually discernible instances CC of TT in II. The model's estimate is denoted C^\hat{C}. Counting is considered open-world when TT may be drawn from an unbounded vocabulary that is either unseen during training or specified only by a prompt at test time (Hou et al., 17 Dec 2025, Amini-Naieni et al., 2024, Amini-Naieni et al., 2023, Amini-Naieni et al., 29 Dec 2025).

Evaluation metrics include:

  • Mean Absolute Error (MAE):

MAE=1Ni=1NCiC^i\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^N |C_i - \hat C_i|

where NN is the number of test instances.

  • Enumeration Accuracy (Acc):

Acc=1Ni=1N1[C^i=Ci]\mathrm{Acc} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}[\hat{C}_i = C_i]

measuring the fraction of images with perfectly correct counts.

  • Root Mean Square Error (RMSE) and related dataset-specific or multi-label metrics (e.g., mRMSE for multi-label counts in OmniCount).

Typical experimental scopes limit C40C \leq 40 to avoid regimes where both human and model subitizing fail (Hou et al., 17 Dec 2025).

2. Architectural Approaches

Approaches to multi-modal open-world counting fall into the following technical classes:

  • Specialized Counting Architectures: Traditional models such as PseCo, TFOC, and T2ICount rely on domain-specific representations—e.g., point-level supervision, SAM-inspired segmentation decoders, clustering of mask centroids, or direct regression on global features. These models are typically constrained to a fixed vocabulary and are brittle under open-vocabulary or cluttered conditions (Hou et al., 17 Dec 2025).
  • Vision–LLMs (VLMs): Large-scale multi-modal transformers (e.g., Claude 4.5 Sonnet, Gemini 2.5 Pro, GPT-5, Qwen3-VL) perform counting in a zero-shot regime by jointly processing visual and textual inputs. VLMs exhibit strong transfer to unseen classes, with enumeration accuracy improving when prompted to generate intermediate representations such as locations (bounding boxes) and labels for each counted item (Hou et al., 17 Dec 2025, Füzesséry et al., 2 Dec 2025).
  • Unified Open-World Counting Models:
    • CountGD/CountGD++ extend open-vocabulary foundation models (GroundingDINO+Swin–BERT) with promptable fusion modules, allowing target specification by text, exemplars, or combinations, and supporting positive (“what to count”) and negative (“what not to count”) constraints (Amini-Naieni et al., 2024, Amini-Naieni et al., 29 Dec 2025).
    • OmniCount applies a training-free semantic-geometric refinement pipeline combining Side Adapter Networks (CLIP-based), monocular depth priors (Marigold), and Segment Anything Model (SAM)-guided instance mask generation, supporting multi-label counting of arbitrary text-specified categories (Mondal et al., 2024).
    • CounTX implements a single-stage, end-to-end transformer decoder built on pre-trained joint text-image representations (CLIP ViT–B/16), directly regressing density maps from free-form description + image pairs (Amini-Naieni et al., 2023).
  • Prompt Engineering and Adaptation:

These systems are complemented by architectures targeting complex reasoning (e.g., Relational Counting Network in TallyQA (Acharya et al., 2018)) and by adaptations for amodal (occluded) counting (CountOCC (Arib et al., 16 Nov 2025)) and long-form video inputs (CountVid (Amini-Naieni et al., 18 Jun 2025)).

3. Prompt Modalities and Fusion Mechanisms

Prompt specification in multi-modal open-world counting utilizes three main modalities:

  • Text Prompts: Natural-language object or attribute descriptors, tokenized and embedded by BERT, CLIP, or similar encoders; these allow access to unseen, arbitrary, or fine-grained categories (Amini-Naieni et al., 2024, Amini-Naieni et al., 2023).
  • Visual Exemplars: Cropped patches or bounding boxes of the target object, drawn from either the current or external images, processed via shared image encoders and projected to compact “exemplar tokens” (Amini-Naieni et al., 2024, Amini-Naieni et al., 29 Dec 2025).
  • Negative Prompts: (CountGD++): Explicit instructions of "what not to count," specified via negative text and/or negative exemplars, enabling fine-grained, disambiguated counting (Amini-Naieni et al., 29 Dec 2025).

Fusion pipelines consist of:

  • Transformer-based feature enhancers: self-attention over (text + exemplars) and cross-attention with image features.
  • Late-stage cross-modality decoders: top-k query selection by prompt similarity, hungarian-matched confidence matrices, instance mask or bounding box heads, and density-map regression.
  • For amodal and occlusion-robust models (CountOCC), hierarchically reconstructing occluded features via pyramid-level FRMs, spatial context from visible fragments, and visual equivalence losses enforcing attention alignment across occluded/unoccluded pairs (Arib et al., 16 Nov 2025).

Advanced adaptation techniques include pseudo-exemplar bootstrapping (automatically harvesting exemplars from initial text-only passes) and LLM-synthesized exemplars for rare classes (Amini-Naieni et al., 29 Dec 2025).

4. Benchmarks and Datasets

Key datasets underpinning the evaluation of multi-modal open-world counting:

Dataset Modalities Categories Annotation Density/Clutter Notable Use
FSC-147 Exemplar+Text 147 Points Varied Standard open-world image counting (Hou et al., 17 Dec 2025, Amini-Naieni et al., 2024)
FSCD-LVIS Text 800+ Points Extreme Long-tailed, occlusion & ambiguity (Hou et al., 17 Dec 2025)
SolidCount N/A (syn.) 8 Shapes × 8 Colors Points Synthetic, controlled Visual factor analysis (shape, color, clutter) (Hou et al., 17 Dec 2025)
OmniCount-191 Text/Multi 191 Points, Boxes, Segm, VQA Urban, satellite, retail, etc. Multi-label, open-vocab (Mondal et al., 2024)
CARPK; PUCPR+ Text/Exemplar 1 Boxes Densely parked cars Drone aerial, domain transfer (Zhao et al., 24 Apr 2025)
VideoCount Text/Exemplar 100+ Per-frame tracks Crowded, scientific Video instance-level, occlusion (Amini-Naieni et al., 18 Jun 2025)
FSC-147-D Text 147 Free-form desc Varied Text prompt variation (Amini-Naieni et al., 2023)
TallyQA Text VQA splits QA Simple/Complex Attribute/relational queries (Acharya et al., 2018)

5. Quantitative and Qualitative Results

Recent findings highlight:

  • VLMs (Gemini 2.5 Pro, GPT-5) with structured prompting (“point, label, & count”) achieve SOTA MAE and accuracy on FSC-147 and SolidCount, outperforming specialized architectures, with MAE as low as 2.43 on FSC-147 (Gemini) and structured prompt enumeration accuracy up to 0.80 (SolidCount, Gemini) (Hou et al., 17 Dec 2025).
  • Specialist counters, e.g. PseCo, exhibit catastrophic failure under high background clutter (SolidCount, checkerboard: MAE explodes from 101 to 406), while VLMs and unified multi-modal models are largely robust.
  • Ablations confirm that enumeration accuracy in VLMs is significantly boosted by inducing the generation of intermediate object locations and labels—an explicit serial counting strategy.
  • Generalization beyond human subitizing range (C>40C>40) remains challenging for all models.
  • Negative prompt specification and iterative pseudo-exemplar selection in CountGD++ enhance discrimination and reduce false positives, yielding improvements for both images and cross-frame video tasks (Amini-Naieni et al., 29 Dec 2025).
  • Occlusion-robust frameworks (CountOCC) deliver up to 26.7% and 49.9% MAE reduction on occluded FSC-147 and CARPK, respectively (Arib et al., 16 Nov 2025).
  • Multi-label frameworks (OmniCount) enable efficient single-pass, open-vocabulary counting for hundreds of categories with competitive or better mRMSE than prior methods (e.g., 0.70 on OmniCount-191) (Mondal et al., 2024).

Examples of model-specific performance:

Model FSC-147 MAE (Test) SolidCount Acc FSCD-LVIS MAE
Gemini (VLM) 2.43 0.80 14.66
PseCo (specialist) 31.77 0.06 108.46
GPT-5 (VLM) 3.46 0.21 15.86
CountGD (text+ex) 5.74 (corr.)

6. Limitations and Open Challenges

  • VLMs and multi-modal counting systems degrade significantly for highly ambiguous categories, heavy occlusion, or dense, repetitive patterns (e.g., thousands of Lego pieces) (Füzesséry et al., 2 Dec 2025, Hou et al., 17 Dec 2025).
  • Contemporary approaches struggle with counts beyond the subitizing regime (C>40C>40), with enumeration accuracy dropping sharply (Hou et al., 17 Dec 2025).
  • Negative prompt mechanisms require clear semantic distinctions; ambiguous or composite negative prompts may reduce recall or misclassify borderline cases (Amini-Naieni et al., 29 Dec 2025).
  • Occlusion amodal counting models (CountOCC) are optimized for total counts rather than pixel-exact localization, so dense maps under occluders may not precisely match real object centers (Arib et al., 16 Nov 2025).
  • Prompt- and LLM-based counting is sensitive to prompt construction, including verbosity, description specificity, and the inclusion of structured enumeration instructions (Füzesséry et al., 2 Dec 2025).
  • Models relying solely on text prompts often cannot resolve fine-scale visual ambiguities or attribute-based distinctions, motivating multi-modal fusion and promptable patch guidance (Amini-Naieni et al., 2024, Zhao et al., 24 Apr 2025).
  • Pseudo- and synthetic-exemplar pipelines depend on accurate initial detections and style-aligned exemplar generation, respectively; catastrophic first-pass errors can propagate through the iterative process (Amini-Naieni et al., 29 Dec 2025).

7. Future Directions

Ongoing and suggested research includes:

  • Construction of new open-world counting benchmarks with clear, unambiguous annotations, tunable visual complexity, and dual video–image modes (Hou et al., 17 Dec 2025, Amini-Naieni et al., 18 Jun 2025).
  • Integration of explicit symbolic counting procedures or memory buffers within VLMs or transformer decoders to more closely emulate human enumeration (Hou et al., 17 Dec 2025).
  • Systematic development and optimization of prompt engineering strategies (e.g., chain-of-thought, iterative refinement) for visual enumeration tasks (Hou et al., 17 Dec 2025, Füzesséry et al., 2 Dec 2025).
  • End-to-end training of unified architectures that jointly fuse prompt specification, detection, segmentation, tracking, and counting in both static and dynamic settings (Amini-Naieni et al., 18 Jun 2025).
  • Research into uncertainty quantification for open-world counts, critical for deployment in ambiguous or critical domains (Amini-Naieni et al., 2024).
  • Advances in semantic-driven visual prompt tuning (e.g., SDVPT), non-linear prompt aggregation, and cross-modality attention as means to robustly transfer knowledge from seen to unseen categories (Zhao et al., 24 Apr 2025).
  • Incorporation of efficient and robust modalities (super-resolution, low-light enhancement, improved monocular/multi-view depth) for enhanced counting under challenging conditions (Mondal et al., 2024).

In summary, multi-modal open-world counting synthesizes advances in vision–language modeling, prompt adaptation, robust enumeration, and multi-modal scene understanding to provide scalable, category-agnostic, and context-adaptive solutions for counting in unconstrained real-world environments. Continued progress depends on principled integration of multi-modal prompts, robust fusion architectures, and open-domain evaluation methodologies.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Multi-Modal Open-World Counting.