Region-level Sparse Counter (RSC) Overview
- Region-level Sparse Counter (RSC) is a sparse, transformer-based module that anchors object counting by detecting discrete, high-confidence spatial points for large objects.
- It employs a DETR-style region decoding pipeline with learnable queries and cross-attention to predict object centers without using handcrafted anchors.
- Integrated with a dense counter via parameter-free complementary fusion, RSC enhances precision and generalizability across diverse counting scenarios.
The Region-level Sparse Counter (RSC) is a key architectural component within the dual-granularity enumeration design of the Count Anything model, developed for text-guided object counting across diverse domains. RSC operates as a sparse, object-level branch that anchors the counting process for large and well-separated objects by producing discrete, high-confidence spatial points tied to individual object instances. The method employs a DETR-style region decoding pipeline leveraging learnable queries, transformer-based cross-attention, and multi-headed prediction to yield interpretable, instance-grounded outputs. RSC is paired with a Pixel-level Dense Counter (PDC), and their predictions are reconciled through a parameter-free Complementary Count Fusion step to improve generalization and flexibility over heterogeneous counting scenarios (Lei et al., 29 May 2026).
1. Architectural Overview
The RSC module forms one half of Count Anything's dual-branch architecture for object counting. Given an input image and text query , a text-conditioned visual backbone is constructed using a frozen SAM3 encoder for vision features and learned LoRA adapters for light-weight cross-modal fusion. The visual backbone and a text encoder yield multi-scale features and a query embedding , which a cross-modal encoder fuses into an object-level feature map .
Within RSC, a set of learnable region queries , as vector embeddings in , are processed by Transformer decoder layers with cross-attention on 0. This yields decoder outputs 1, which are individually fed to parallel prediction heads:
- The classification head predicts a foreground logit 2
- The regression head yields a normalized bounding-box vector 3 via elementwise sigmoid activation
- Both heads are parameterized by MLPs or linear layers
The confidence score for each prediction, 4, determines filtering at inference time. The predicted object location is given by the center of the bounding box, 5. No hand-crafted anchors are present; all anchors emerge through learning.
2. One-to-One Supervision and Matching
Supervision for region queries applies a bipartite matching strategy between 6 predicted centers and 7 ground-truth counting points 8. Hungarian matching is used to enforce assignment based on the cost:
9
with weightings 0 and 1. The assignment set 2, ensures each true instance maps to at most one predicted region and vice versa. This one-to-one assignment enables precise object-level supervision, reducing duplicate counts and maximizing spatial correspondence.
3. Region-level Losses and Optimization
RSC employs three principal losses for matched predictions:
- Point Localization Loss: Penalizes 3 error between predicted and ground-truth centers:
4
- Box Regression and GIoU Loss (when ground-truth box 5 exists):
6
7
where 8 denotes ground-truth box presence.
- Classification Loss with Quality-aware Soft Targets: For each match 9, an "effective" region 0 is either 1 (if available) or an auxiliary box centered on 2 (if not). The coverage quality 3 is the fractional intersection of the predicted box with 4. A soft foreground target is constructed as
5
with 6, 7. The classification loss is binary cross-entropy:
8
Unmatched queries receive background labels. The total RSC loss is aggregated as:
9
with weights 0, 1, 2, 3.
4. Inference, Filtering, and Complementary Fusion
At inference, RSC outputs are filtered by confidence 4. Duplicate regions are merged via intersection-over-minimum (IoM)-based non-maximum suppression (NMS) with threshold 5. The remaining RSC-detected centers are then combined with independent, densely predicted points from PDC.
Complementary Count Fusion operates by, for each RSC box, locating the PDC point nearest the center within that box and suppressing the lower-confidence of the RSC or PDC pair. Other PDC points in the region are retained. The union of retained RSC centers and surviving PDC points comprises the final set of object instance locations, whose cardinality gives the predicted count. This fusion ensures that clearly bounded large objects are counted exactly once, while maximal recall is maintained in ambiguous, crowded scenes.
5. Training Regime and Hyperparameters
RSC employs 6 learnable queries and the same number of transformer decoder layers (7) as used in DETR. Box heads are three-layer MLPs; classification heads are single linear layers. Training uses AdamW with learning rate 8, default 9 parameters (0, 1), weight decay 2, batch size 3 per GPU, and cosine learning-rate decay to 10% of initial over 4 epochs. The box parameterization constrains all predicted coordinates and sizes to 5 normalized to the image. No hand-crafted anchors are introduced; region queries are directly learned for the counting task. These settings are tightly aligned with the architectural details and experimental settings specified in Count Anything (Lei et al., 29 May 2026).
6. Context, Role, and Significance within Generalist Counting Models
The introduction of RSC as the sparse branch in dual-granularity enumeration addresses the limitations of prior density-map methods which struggle to scale across object size, density, and heterogeneity. By learning sparse, interpretable object-level anchors in conjunction with the PDC's dense grid-based recall, Count Anything achieves superior stability and spatial correspondence for large and sparsely distributed objects, while retaining adaptability to crowded or small-object domains. The modular design, cross-modal conditioning, and parameter-free prediction fusion together enable robust, multi-domain object counting from heterogeneous annotations. This reflects a foundational shift toward unified, interpretable, and scalable counting architectures across open-world contexts (Lei et al., 29 May 2026).