Papers
Topics
Authors
Recent
2000 character limit reached

Video Individual Counting (VIC) Overview

Updated 10 January 2026
  • The paper illustrates how advanced matching paradigms, such as differentiable optimal transport and one-to-many approaches, effectively tackle occlusions and ID-switch issues in crowded scenes.
  • Video Individual Counting is the task of accurately tallying unique objects in video sequences, addressing temporal association and context challenges through density and matching methods.
  • Modern VIC methods integrate context generators, displacement priors, and weak supervision to provide robust performance in diverse scenarios from surveillance to open-world object counting.

Video Individual Counting (VIC) is a visual recognition task focused on enumerating the unique instances of a given object (most commonly pedestrians or vehicles) appearing in a fixed-length video sequence, such that each object is counted exactly once, irrespective of occlusions, scene clutter, or repeated appearances. VIC generalizes conventional crowd or frame-level counting by explicitly requiring per-video identification of all unique objects, posing stringent correspondence, association, and temporal aggregation challenges in scenarios ranging from dense pedestrian surveillance to fine-grained open-world object counting in arbitrary, dynamic environments (Han et al., 2022, Lu et al., 3 Jan 2026, Amini-Naieni et al., 18 Jun 2025).

1. Formal Task Definition and Core Challenges

Given a video sequence I={I0,I1,…,IT}I = \{I_0, I_1, \ldots, I_T\}, VIC aims to output the total number of distinct object instances NtotalN_{\text{total}} that appear at any time in the video, i.e.,

Ntotal=∣⋃t=0TS(t)∣N_{\text{total}} = \left| \bigcup_{t=0}^{T} S(t) \right|

where S(t)S(t) is the set of detected objects (e.g., head centers, bounding boxes) in frame tt. The definitive challenge is resolving temporal correspondences: detecting when an object is new (inflow), persisting, or has exited (outflow), while being robust to occlusions, appearance changes, and scene density. This contrasts sharply with Video Crowd Counting (VCC), which seeks only per-frame scalar or density-map outputs and does not require resolving cross-frame identities (Zhu et al., 16 Jun 2025, Lu et al., 3 Jan 2026).

The task is compounded in open-world settings, where the target object class is prompted at inference and may not have been seen in training, and in dense scenarios featuring high dynamic occlusions and appearance ambiguity (Amini-Naieni et al., 18 Jun 2025, Lu et al., 3 Jan 2026).

2. Methodological Paradigms: From Localization/Association to Matching-Based Inference

Early VIC systems were predicated on explicit detection/localization followed by trajectory-level Multi-Object Tracking (MOT). However, full-video tracking is prone to ID-switches and error propagation in dense situations (Han et al., 2022). The field has rapidly progressed to matching-based decompositions and association-free density models.

Decomposition and Reasoning: Decomposition-based pipelines, as introduced in DRNet (Han et al., 2022), segment the problem into: (i) density-based counting in the initial frame, and (ii) per-pair inflow estimation via differentiable optimal transport (OT) between object proposals in temporally sampled frames. Inflow is defined as objects appearing in ItI_t but not in It−τI_{t-\tau}; summing detected inflow across sampled pairs plus the initial count yields NtotalN_{\text{total}}:

Ntotal=N(0)+∑k=1⌊(T−1)/τ⌋Ninτ(kτ)N_{\text{total}} = N(0) + \sum_{k=1}^{\lfloor(T-1)/\tau\rfloor} N_{\text{in}}^\tau(k \tau)

with OT used to softly match detected descriptors between frames. DRNet integrates an augmented cost matrix over appearance and class-specific priors with entropic Sinkhorn regularization to encourage global consistency and robust matching (Han et al., 2022).

Density Map Flow Methods: Alternatives eschew explicit identity correspondence in favor of inflow/outflow density-map estimation. Notably, density-based models with cross-frame attention (e.g., DCFA (Fan et al., 12 Mar 2025)) directly regress inflow and outflow maps for each frame pair, decomposing conservation as

Dt(x,y)=Dt−1(x,y)+It(x,y)−Ot(x,y)D_t(x, y) = D_{t-1}(x, y) + I_t(x, y) - O_t(x, y)

and calculating the unique total as the sum of initial density plus all inflows.

Matching Paradigms: The central research axis has evolved from strict one-to-one (O2O) matching (Hungarian assignment) to one-to-many (O2M) matching, motivated by empirical observations of pedestrian social grouping in crowd scenarios. O2M matching, as formalized in OMAN and OMAN++ (Zhu et al., 16 Jun 2025, Lu et al., 3 Jan 2026), allows detections in one frame to be matched to a (bounded) set of candidates in the adjacent frame, typically implemented as a soft, entropy-regularized OT plan with additional context and prior injection, leading to increased robustness under high occlusion and detection noise.

3. Modern VIC Architectures: Context, Priors, and Displacement Modeling

Recent state-of-the-art VIC systems adopt modular, attention-driven pipelines featuring:

Implicit Context Generator (ICG): Concatenation of per-person descriptors from adjacent frames, followed by multi-head self-attention to propagate context, capture group structure, and form enhanced representations for pairwise matching (Zhu et al., 16 Jun 2025, Lu et al., 3 Jan 2026).

Pairwise Matcher (OMPM): For every pair of proposals in the frame pair, a Hadamard product of descriptors is fed through an MLP, yielding a soft correspondence probability pijp^{ij}. The O2M relaxation is operationalized by allowing each detection to distribute its matching mass over multiple targets (subject to constraints derived from social grouping priors) (Lu et al., 3 Jan 2026).

Displacement Prior Injection (DPI) and Displacement-Aware Self-Attention (DASA): Physical priors are embedded by modeling expected inter-frame displacement vectors and using these as additional inputs to attention and matcher modules, enforcing that plausible matchings are consistent with the dynamics of real-world motion. Priors enter into matching via modulating the attention or pairwise cost, and are incorporated into the loss via a displacement-informed OT objective, which blends appearance and motion cost components (Lu et al., 3 Jan 2026).

Weak-Supervision and Group-Level Learning: Weakly supervised VIC (WVIC) leverages only inflow/outflow indicators (without full identity trajectories) to drive contrastive learning over candidate sets—pushing feature representations to separate incoming/outgoing from persisting objects, using OT-based soft permutation losses (Liu et al., 2023).

Open-World and Promptable Models: The CountVid system (Amini-Naieni et al., 18 Jun 2025) applies open-world visual grounding and mask propagation (CountGD-Box + SAM 2.1) to enumerate unique instances of any target object class specified by text/image prompt. Temporal filtering and masklet-long tracklet management are used to avoid double-counting across occlusions and maintain global instance consistency.

4. Evaluation Benchmarks and Quantitative Results

Quantitative evaluation of VIC is standardized on metrics including Mean Absolute Error (MAE), Mean Squared Error (MSE), and Weighted Relative Absolute Error (WRAE) computed on the global per-video count.

Key Benchmarks:

Method Dataset MAE WRAE (%) Key Innovations
DRNet CroHD 141.1 27.4 Density+OT matching
OMAN SenseCrowd 8.3 11.1 Implicit O2M context/match
OMAN++ WuhanMetroCrowd 87.1 19.8 Social + displacement priors
CGNet SenseCrowd 8.9 12.6 Group-level contrastive OT
DCFA MovingDroneCrowd 41.0 19.3 Density+cross-frame attention
CountVid TAO-Count 2.6 — Prompted open-world counting

OMAN++ demonstrates up to 38.12% WRAE reduction vs. previous state-of-the-art on WuhanMetroCrowd, attributed to the joint modeling of social grouping and displacement priors (Lu et al., 3 Jan 2026). In highly dynamic drone set-ups, density-based DCFA exhibits robust superiority as crowds thicken (Fan et al., 12 Mar 2025). Open-world test cases, e.g., prompt-driven VideoCount, have closed MAE to 2.6 on TAO-Count (Amini-Naieni et al., 18 Jun 2025).

5. Domain-Specific Applications and System Designs

Retail Checkout (Multi-Class VIC): Systems such as VISTA tailor the VIC protocol to retail product differentiation, using a unified U-Net for hand-plus-item segmentation, ViT-based multi-class classification, and frame selection via a Colorfulness-Binarization-Threshold (CBT) metric. Temporal aggregation and near-duplicate suppression mediate video-level count recovery, with macro-F1 as the principal evaluation score. The combination of instance segmentation, entropy masking, and ViT has yielded F1 of 0.4545 in real-world video challenges (Shihab et al., 2022).

Vehicle Counting: Key-frame-based solutions (e.g., Visual Rhythm + YOLOv8) eschew dense tracking in favor of line-crossing cue extraction and selection of informative frames, achieving >99% counting accuracy at up to 3x speedup vs. conventional tracking (Ribeiro et al., 8 Jan 2025). Such pipelines are optimal for unidirectional, static-camera, moderate-flow regimes, but break down in severe occlusion or directionally heterogeneous traffic.

Multi-Camera and Indirect Approaches: Hybrid multi-view designs combine per-view SVM/AdaBoost head detection and indirect corner-based statistical estimation (e.g., Harris corners and region weighting), the latter being markedly more robust in high-density/occlusion settings (Dittrich et al., 2017).

6. Limitations, Open Problems, and Research Directions

Current SOTA models report several limitations:

  • Persistent errors in extremely dense or highly occluded scenes, particularly when detection or localization quality drops or in "lobby" scenarios (WRAE ≈ 43%) (Lu et al., 3 Jan 2026).
  • Reliance on accurate pedestrian or object localization—propagating errors from initial detection into the matching and counting stages.
  • Fixed frame-pair intervals may miss correspondences at highly varying temporal rates.
  • Hand-tuned prior weighting (λ), and hyperparameter scheduling lacking full adaptivity (Lu et al., 3 Jan 2026, Zhu et al., 16 Jun 2025).
  • For open-world VIC, failure modes include missed small/faint instances, drift under long occlusion, and susceptibility to distractor matches (Amini-Naieni et al., 18 Jun 2025).

This suggests there is substantial value in dynamic scheduling of temporal aggregation windows, adaptive motion priors, memory-augmented re-id modules for re-entrant individuals, and joint end-to-end learning of all system components (including detection, encoding, temporal association, and aggregation) (Lu et al., 3 Jan 2026, Liu et al., 2023). Multi-camera extensions and robust, cross-domain evaluation in arbitrary object classes remain major unsolved challenges (Amini-Naieni et al., 18 Jun 2025, Dittrich et al., 2017).

7. Summary Table: Representative Models

Model Key Mechanism Scenario Notable Results / Datasets Reference
DRNet Density + differentiable OT inflow Pedestrian MAE=12.3, WRAE=12.7% (SenseCrowd) (Han et al., 2022)
OMAN++ O2M match + grouping/displacement priors Crowded Ped. MAE=87.1, WRAE=19.8% (WuhanMetro) (Lu et al., 3 Jan 2026)
DCFA Density + cross-frame attention Drone Crowds MAE=41.0, WRAE=19.3% (MDC) (Fan et al., 12 Mar 2025)
CGNet Weak group-level, contrastive OT Pedestrian MAE=8.9, WRAE=12.6% (SenseCrowd) (Liu et al., 2023)
CountVid Prompt-based masklet propagation Open-world MAE=2.6 (TAO-Count), 50 (MOT20) (Amini-Naieni et al., 18 Jun 2025)
VISTA U-Net+ViT+CBT+duplicate removal Retail F1=0.4545 (AICITY22) (Shihab et al., 2022)

References

These works collectively formalize the methodological landscape and establish robust, context- and prior-informed protocols for accurate, agile, and generalizable Video Individual Counting in diverse real-world scenarios.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Video Individual Counting (VIC).