Papers
Topics
Authors
Recent
Search
2000 character limit reached

Region-level Sparse Counter (RSC) Overview

Updated 5 June 2026
  • Region-level Sparse Counter (RSC) is a sparse, transformer-based module that anchors object counting by detecting discrete, high-confidence spatial points for large objects.
  • It employs a DETR-style region decoding pipeline with learnable queries and cross-attention to predict object centers without using handcrafted anchors.
  • Integrated with a dense counter via parameter-free complementary fusion, RSC enhances precision and generalizability across diverse counting scenarios.

The Region-level Sparse Counter (RSC) is a key architectural component within the dual-granularity enumeration design of the Count Anything model, developed for text-guided object counting across diverse domains. RSC operates as a sparse, object-level branch that anchors the counting process for large and well-separated objects by producing discrete, high-confidence spatial points tied to individual object instances. The method employs a DETR-style region decoding pipeline leveraging learnable queries, transformer-based cross-attention, and multi-headed prediction to yield interpretable, instance-grounded outputs. RSC is paired with a Pixel-level Dense Counter (PDC), and their predictions are reconciled through a parameter-free Complementary Count Fusion step to improve generalization and flexibility over heterogeneous counting scenarios (Lei et al., 29 May 2026).

1. Architectural Overview

The RSC module forms one half of Count Anything's dual-branch architecture for object counting. Given an input image II and text query TT, a text-conditioned visual backbone is constructed using a frozen SAM3 encoder for vision features and learned LoRA adapters for light-weight cross-modal fusion. The visual backbone and a text encoder yield multi-scale features {Vl}l=1L\{V_l\}_{l=1}^L and a query embedding eTe_T, which a cross-modal encoder Φobj\Phi_\text{obj} fuses into an object-level feature map FobjRH×W×CF_\text{obj} \in \mathbb{R}^{H \times W \times C}.

Within RSC, a set of QrQ_r learnable region queries {qi0}i=1Qr\{q_i^0\}_{i=1}^{Q_r}, as vector embeddings in Rd\mathbb{R}^d, are processed by LdL_d Transformer decoder layers with cross-attention on TT0. This yields decoder outputs TT1, which are individually fed to parallel prediction heads:

  • The classification head predicts a foreground logit TT2
  • The regression head yields a normalized bounding-box vector TT3 via elementwise sigmoid activation
  • Both heads are parameterized by MLPs or linear layers

The confidence score for each prediction, TT4, determines filtering at inference time. The predicted object location is given by the center of the bounding box, TT5. No hand-crafted anchors are present; all anchors emerge through learning.

2. One-to-One Supervision and Matching

Supervision for region queries applies a bipartite matching strategy between TT6 predicted centers and TT7 ground-truth counting points TT8. Hungarian matching is used to enforce assignment based on the cost:

TT9

with weightings {Vl}l=1L\{V_l\}_{l=1}^L0 and {Vl}l=1L\{V_l\}_{l=1}^L1. The assignment set {Vl}l=1L\{V_l\}_{l=1}^L2, ensures each true instance maps to at most one predicted region and vice versa. This one-to-one assignment enables precise object-level supervision, reducing duplicate counts and maximizing spatial correspondence.

3. Region-level Losses and Optimization

RSC employs three principal losses for matched predictions:

  • Point Localization Loss: Penalizes {Vl}l=1L\{V_l\}_{l=1}^L3 error between predicted and ground-truth centers:

{Vl}l=1L\{V_l\}_{l=1}^L4

  • Box Regression and GIoU Loss (when ground-truth box {Vl}l=1L\{V_l\}_{l=1}^L5 exists):

{Vl}l=1L\{V_l\}_{l=1}^L6

{Vl}l=1L\{V_l\}_{l=1}^L7

where {Vl}l=1L\{V_l\}_{l=1}^L8 denotes ground-truth box presence.

  • Classification Loss with Quality-aware Soft Targets: For each match {Vl}l=1L\{V_l\}_{l=1}^L9, an "effective" region eTe_T0 is either eTe_T1 (if available) or an auxiliary box centered on eTe_T2 (if not). The coverage quality eTe_T3 is the fractional intersection of the predicted box with eTe_T4. A soft foreground target is constructed as

eTe_T5

with eTe_T6, eTe_T7. The classification loss is binary cross-entropy:

eTe_T8

Unmatched queries receive background labels. The total RSC loss is aggregated as:

eTe_T9

with weights Φobj\Phi_\text{obj}0, Φobj\Phi_\text{obj}1, Φobj\Phi_\text{obj}2, Φobj\Phi_\text{obj}3.

4. Inference, Filtering, and Complementary Fusion

At inference, RSC outputs are filtered by confidence Φobj\Phi_\text{obj}4. Duplicate regions are merged via intersection-over-minimum (IoM)-based non-maximum suppression (NMS) with threshold Φobj\Phi_\text{obj}5. The remaining RSC-detected centers are then combined with independent, densely predicted points from PDC.

Complementary Count Fusion operates by, for each RSC box, locating the PDC point nearest the center within that box and suppressing the lower-confidence of the RSC or PDC pair. Other PDC points in the region are retained. The union of retained RSC centers and surviving PDC points comprises the final set of object instance locations, whose cardinality gives the predicted count. This fusion ensures that clearly bounded large objects are counted exactly once, while maximal recall is maintained in ambiguous, crowded scenes.

5. Training Regime and Hyperparameters

RSC employs Φobj\Phi_\text{obj}6 learnable queries and the same number of transformer decoder layers (Φobj\Phi_\text{obj}7) as used in DETR. Box heads are three-layer MLPs; classification heads are single linear layers. Training uses AdamW with learning rate Φobj\Phi_\text{obj}8, default Φobj\Phi_\text{obj}9 parameters (FobjRH×W×CF_\text{obj} \in \mathbb{R}^{H \times W \times C}0, FobjRH×W×CF_\text{obj} \in \mathbb{R}^{H \times W \times C}1), weight decay FobjRH×W×CF_\text{obj} \in \mathbb{R}^{H \times W \times C}2, batch size FobjRH×W×CF_\text{obj} \in \mathbb{R}^{H \times W \times C}3 per GPU, and cosine learning-rate decay to 10% of initial over FobjRH×W×CF_\text{obj} \in \mathbb{R}^{H \times W \times C}4 epochs. The box parameterization constrains all predicted coordinates and sizes to FobjRH×W×CF_\text{obj} \in \mathbb{R}^{H \times W \times C}5 normalized to the image. No hand-crafted anchors are introduced; region queries are directly learned for the counting task. These settings are tightly aligned with the architectural details and experimental settings specified in Count Anything (Lei et al., 29 May 2026).

6. Context, Role, and Significance within Generalist Counting Models

The introduction of RSC as the sparse branch in dual-granularity enumeration addresses the limitations of prior density-map methods which struggle to scale across object size, density, and heterogeneity. By learning sparse, interpretable object-level anchors in conjunction with the PDC's dense grid-based recall, Count Anything achieves superior stability and spatial correspondence for large and sparsely distributed objects, while retaining adaptability to crowded or small-object domains. The modular design, cross-modal conditioning, and parameter-free prediction fusion together enable robust, multi-domain object counting from heterogeneous annotations. This reflects a foundational shift toward unified, interpretable, and scalable counting architectures across open-world contexts (Lei et al., 29 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
1.
Count Anything  (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Region-level Sparse Counter (RSC).