Count Anything

Published 29 May 2026 in cs.CV | (2605.30846v1)

Abstract: Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions. In this paper, we study text-guided object counting across domains, where a model takes an image and a natural-language query as input and returns an instance-grounded set of target points whose cardinality gives the count. This formulation unifies category-conditioned counting with interpretable spatial localization. To support this setting, we construct CLOC, a Cross-domain Large-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances. Based on CLOC, we propose Count Anything, a generalist model for text-guided object counting. Unlike density-map-based methods, which dominate counting models, Count Anything adopts discrete instance points and performs dual-granularity instance enumeration. A Region-level Sparse Counter provides object-level anchors for large and sparse targets, while a Pixel-level Dense Counter handles small, crowded, and weakly bounded targets via dense point prediction. A point-centric supervision strategy enables learning from heterogeneous annotations, and Complementary Count Fusion combines both counters in a parameter-free manner. Extensive experiments show that Count Anything achieves strong accuracy and multi-domain generalization, outperforming existing open-world counting methods. Code is available at: https://github.com/Mengqi-Lei/count-anything.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper presents a unified text-guided counting task using a dual-branch model that combines region-level sparse and pixel-level dense counting to reduce error.
The paper introduces CLOC, a large-scale benchmark with 220,000 images, 619 categories, and 15 million instances for robust cross-domain evaluation.
The paper demonstrates that Complementary Count Fusion effectively merges outputs from both branches, enhancing interpretability and cross-domain stability.

Count Anything: A Generalist Framework for Cross-Domain Text-Guided Object Counting

Introduction and Motivation

"Count Anything" (2605.30846) addresses the persistent fragmentation in object counting research caused by the prevalence of domain-specific datasets and specialized models. While vision foundation models have demonstrated robust cross-domain generalization in detection and segmentation, object counting remains tied to closed scenarios, limiting transferability across scales, categories, and density regimes. This work proposes a unified text-guided instance-grounded counting task, coupled with CLOC—a large-scale, multi-source, cross-domain object counting benchmark designed to span General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology. The framework introduces a dual-branch modeling strategy, integrating region-level sparse enumeration and pixel-level dense recall, to bridge the representation gap encountered in current counting paradigms.

CLOC: Cross-Domain Large-Scale Object Counting Benchmark

The CLOC dataset is constructed via systematic reorganization of diverse public detection, segmentation, and counting datasets into a unified category-conditional counting format. CLOC comprises approximately 220,000 images, 619 categories, and 15 million annotated object instances, covering a broad array of visual domains and target-count densities.

Figure 1: Representative visual examples from the six visual domains of CLOC, illustrating substantial heterogeneity in domain appearance, density, and annotation.

Notably, CLOC harmonizes heterogeneous annotation protocols through rigorous instance audit and conversion, ensuring one-instance-one-count consistency. Its statistical distribution exhibits a marked long-tail in both category occurrence and target counts, critical for robust generalization evaluation.

Figure 2: Category-level image frequency distribution for the 80 most common categories, color-coded by domain and plotted on a logarithmic scale.

CLOC encodes text-guided counting as inference over (image, query) pairs, expecting the model to enumerate all instances matching the provided text query—a concrete instantiation of open-vocabulary counting. The dataset construction pipeline encompasses data aggregation, countability auditing, instance format normalization, category unification, stratified sample augmentation (cropping, stitching), and a split strategy explicitly enforcing non-overlapping semantic category groups for unseen evaluation scenarios.

Figure 3: Overview of the CLOC construction pipeline, visualizing sequential stages of annotation harmonization, category consolidation, and splitting.

Model Architecture: The Count Anything Framework

The proposed model employs a dual-granularity instance enumeration strategy. Rather than relying on monolithic density maps or bounding-box regression, it predicts a set of spatially grounded instance points, supporting both cardinality estimation and interpretable localization.

Figure 4: Overall framework of Count Anything, integrating a Text-Conditioned Encoder with dual counting branches and Complementary Count Fusion.

Text-Conditioned Encoder: Adopts the SAM3 image-language backbone, extracting category-conditioned multi-scale visual features.
Region-level Sparse Counter (RSC): DETR-style region-query decoder, suitable for large, sparse, and well-bounded objects; outputs candidate boxes whose centers serve as instance points.
Pixel-level Dense Counter (PDC): Pixelwise anchor-grid-based dense enumeration, suited to small, crowded, and weakly bounded objects; directly regresses offset instance points.
Complementary Count Fusion (CCF): Parameter-free merging rule that reconciles the output of both branches, suppressing duplicates via local confidence-based comparison, and preserving complementary detections (see Figure 5).
Figure 5: Qualitative visualization of Complementary Count Fusion showing the interplay and suppression of overlapping duplicate predictions.

Training utilizes point-centric supervision to accommodate heterogeneous input annotation regimes—leveraging available bounding boxes for regression, relying on point annotations otherwise, and employing soft quality-based classification targets in region-specific branches.

Empirical Evaluation and Analysis

Count Anything is evaluated on CLOC and benchmarked against state-of-the-art text-guided and few-shot counting approaches as well as generalist detection/segmentation models. Experimental results demonstrate a pronounced margin in accuracy and cross-domain stability.

Figure 6: Qualitative comparison of counting predictions by Count Anything and baselines, illustrating closer alignment to ground truth across domains and densities.

On the full CLOC test set, Count Anything achieves MAE 9.34, RMSE 33.34, and NAE 0.75—representing a drastic reduction in error compared to competitive methods including CLIP-Count, VLCounter, and foundation model-driven detectors. This advantage persists across challenging domains (e.g., Histopathology, Microbiology) and high-density crowd counting, where pure detection and density-estimation models experience marked degradation.

Ablations underscore the necessity of both the RSC and PDC branches, as omitting either leads to significant performance drops. The effectiveness of Complementary Count Fusion is empirically supported—where direct union of prediction sets yields increased duplication errors.

Scaling analysis reveals that data size and domain diversity in CLOC linearly enhance generalist model robustness, affirming CLOC as an essential resource for future research.

Implications and Future Directions

The unified paradigm and architectural design in "Count Anything" represent a decisive shift toward open-world, interpretable, and generalizable counting. The point-set output format enables integrated support for downstream tasks such as instance selection, localization error diagnosis, and interactive counting correction. The decoupling of region-level and pixel-level predictors positions the model to adapt to the inherently multi-scale and heterogeneous nature of real counting applications.

Practically, this framework enables robust, user-customizable counting in domains previously inaccessible to specialist models, notably in scientific imaging and remote observation. Theoretically, the formulation highlights the necessity for hybrid spatial representations and revisits instance definition and supervision in multi-source, weakly labeled settings.

Potential directions include:

Enhanced prompt conditioning leveraging large multimodal LLMs to better resolve semantic ambiguity,
Incorporation of uncertainty estimates and active error-correction,
More adaptive fusion mechanisms, potentially leveraging learned cross-branch attention rather than parameter-free heuristics,
Expansion of CLOC with additional fine-grained categories, advanced domain balancing, or more natural-language-rich query scenarios.

Conclusion

"Count Anything" establishes both a methodological and data-centric foundation for generalist, instance-grounded text-guided counting. Through CLOC and its associated dual-granularity framework, the work sets a new standard for cross-domain, category-agnostic counting and paves the way for developments in open-world visual understanding, real-world automation, and systematic benchmarking of foundation models for counting tasks.

Markdown Report Issue