Video Individual Counting (VIC) Overview
- The paper illustrates how advanced matching paradigms, such as differentiable optimal transport and one-to-many approaches, effectively tackle occlusions and ID-switch issues in crowded scenes.
- Video Individual Counting is the task of accurately tallying unique objects in video sequences, addressing temporal association and context challenges through density and matching methods.
- Modern VIC methods integrate context generators, displacement priors, and weak supervision to provide robust performance in diverse scenarios from surveillance to open-world object counting.
Video Individual Counting (VIC) is a visual recognition task focused on enumerating the unique instances of a given object (most commonly pedestrians or vehicles) appearing in a fixed-length video sequence, such that each object is counted exactly once, irrespective of occlusions, scene clutter, or repeated appearances. VIC generalizes conventional crowd or frame-level counting by explicitly requiring per-video identification of all unique objects, posing stringent correspondence, association, and temporal aggregation challenges in scenarios ranging from dense pedestrian surveillance to fine-grained open-world object counting in arbitrary, dynamic environments (Han et al., 2022, Lu et al., 3 Jan 2026, Amini-Naieni et al., 18 Jun 2025).
1. Formal Task Definition and Core Challenges
Given a video sequence , VIC aims to output the total number of distinct object instances that appear at any time in the video, i.e.,
where is the set of detected objects (e.g., head centers, bounding boxes) in frame . The definitive challenge is resolving temporal correspondences: detecting when an object is new (inflow), persisting, or has exited (outflow), while being robust to occlusions, appearance changes, and scene density. This contrasts sharply with Video Crowd Counting (VCC), which seeks only per-frame scalar or density-map outputs and does not require resolving cross-frame identities (Zhu et al., 16 Jun 2025, Lu et al., 3 Jan 2026).
The task is compounded in open-world settings, where the target object class is prompted at inference and may not have been seen in training, and in dense scenarios featuring high dynamic occlusions and appearance ambiguity (Amini-Naieni et al., 18 Jun 2025, Lu et al., 3 Jan 2026).
2. Methodological Paradigms: From Localization/Association to Matching-Based Inference
Early VIC systems were predicated on explicit detection/localization followed by trajectory-level Multi-Object Tracking (MOT). However, full-video tracking is prone to ID-switches and error propagation in dense situations (Han et al., 2022). The field has rapidly progressed to matching-based decompositions and association-free density models.
Decomposition and Reasoning: Decomposition-based pipelines, as introduced in DRNet (Han et al., 2022), segment the problem into: (i) density-based counting in the initial frame, and (ii) per-pair inflow estimation via differentiable optimal transport (OT) between object proposals in temporally sampled frames. Inflow is defined as objects appearing in but not in ; summing detected inflow across sampled pairs plus the initial count yields :
with OT used to softly match detected descriptors between frames. DRNet integrates an augmented cost matrix over appearance and class-specific priors with entropic Sinkhorn regularization to encourage global consistency and robust matching (Han et al., 2022).
Density Map Flow Methods: Alternatives eschew explicit identity correspondence in favor of inflow/outflow density-map estimation. Notably, density-based models with cross-frame attention (e.g., DCFA (Fan et al., 12 Mar 2025)) directly regress inflow and outflow maps for each frame pair, decomposing conservation as
and calculating the unique total as the sum of initial density plus all inflows.
Matching Paradigms: The central research axis has evolved from strict one-to-one (O2O) matching (Hungarian assignment) to one-to-many (O2M) matching, motivated by empirical observations of pedestrian social grouping in crowd scenarios. O2M matching, as formalized in OMAN and OMAN++ (Zhu et al., 16 Jun 2025, Lu et al., 3 Jan 2026), allows detections in one frame to be matched to a (bounded) set of candidates in the adjacent frame, typically implemented as a soft, entropy-regularized OT plan with additional context and prior injection, leading to increased robustness under high occlusion and detection noise.
3. Modern VIC Architectures: Context, Priors, and Displacement Modeling
Recent state-of-the-art VIC systems adopt modular, attention-driven pipelines featuring:
Implicit Context Generator (ICG): Concatenation of per-person descriptors from adjacent frames, followed by multi-head self-attention to propagate context, capture group structure, and form enhanced representations for pairwise matching (Zhu et al., 16 Jun 2025, Lu et al., 3 Jan 2026).
Pairwise Matcher (OMPM): For every pair of proposals in the frame pair, a Hadamard product of descriptors is fed through an MLP, yielding a soft correspondence probability . The O2M relaxation is operationalized by allowing each detection to distribute its matching mass over multiple targets (subject to constraints derived from social grouping priors) (Lu et al., 3 Jan 2026).
Displacement Prior Injection (DPI) and Displacement-Aware Self-Attention (DASA): Physical priors are embedded by modeling expected inter-frame displacement vectors and using these as additional inputs to attention and matcher modules, enforcing that plausible matchings are consistent with the dynamics of real-world motion. Priors enter into matching via modulating the attention or pairwise cost, and are incorporated into the loss via a displacement-informed OT objective, which blends appearance and motion cost components (Lu et al., 3 Jan 2026).
Weak-Supervision and Group-Level Learning: Weakly supervised VIC (WVIC) leverages only inflow/outflow indicators (without full identity trajectories) to drive contrastive learning over candidate sets—pushing feature representations to separate incoming/outgoing from persisting objects, using OT-based soft permutation losses (Liu et al., 2023).
Open-World and Promptable Models: The CountVid system (Amini-Naieni et al., 18 Jun 2025) applies open-world visual grounding and mask propagation (CountGD-Box + SAM 2.1) to enumerate unique instances of any target object class specified by text/image prompt. Temporal filtering and masklet-long tracklet management are used to avoid double-counting across occlusions and maintain global instance consistency.
4. Evaluation Benchmarks and Quantitative Results
Quantitative evaluation of VIC is standardized on metrics including Mean Absolute Error (MAE), Mean Squared Error (MSE), and Weighted Relative Absolute Error (WRAE) computed on the global per-video count.
Key Benchmarks:
- SenseCrowd, CroHD: Surveillance and drone-based datasets with dense, tracked pedestrian annotations and explicit inflow/outflow or identity labels (Han et al., 2022, Zhu et al., 16 Jun 2025, Lu et al., 3 Jan 2026).
- MovingDroneCrowd: Drone-captured, highly dynamic and crowded videos, providing scenarios with large viewpoint and density variability (Fan et al., 12 Mar 2025).
- WuhanMetroCrowd: High-density, real-world metro commuting scenes with extreme occlusion and flow variation (Lu et al., 3 Jan 2026).
- VideoCount: General object counting with text/image prompting, including open-world categories with fine-grained ground truth (Amini-Naieni et al., 18 Jun 2025).
| Method | Dataset | MAE | WRAE (%) | Key Innovations |
|---|---|---|---|---|
| DRNet | CroHD | 141.1 | 27.4 | Density+OT matching |
| OMAN | SenseCrowd | 8.3 | 11.1 | Implicit O2M context/match |
| OMAN++ | WuhanMetroCrowd | 87.1 | 19.8 | Social + displacement priors |
| CGNet | SenseCrowd | 8.9 | 12.6 | Group-level contrastive OT |
| DCFA | MovingDroneCrowd | 41.0 | 19.3 | Density+cross-frame attention |
| CountVid | TAO-Count | 2.6 | — | Prompted open-world counting |
OMAN++ demonstrates up to 38.12% WRAE reduction vs. previous state-of-the-art on WuhanMetroCrowd, attributed to the joint modeling of social grouping and displacement priors (Lu et al., 3 Jan 2026). In highly dynamic drone set-ups, density-based DCFA exhibits robust superiority as crowds thicken (Fan et al., 12 Mar 2025). Open-world test cases, e.g., prompt-driven VideoCount, have closed MAE to 2.6 on TAO-Count (Amini-Naieni et al., 18 Jun 2025).
5. Domain-Specific Applications and System Designs
Retail Checkout (Multi-Class VIC): Systems such as VISTA tailor the VIC protocol to retail product differentiation, using a unified U-Net for hand-plus-item segmentation, ViT-based multi-class classification, and frame selection via a Colorfulness-Binarization-Threshold (CBT) metric. Temporal aggregation and near-duplicate suppression mediate video-level count recovery, with macro-F1 as the principal evaluation score. The combination of instance segmentation, entropy masking, and ViT has yielded F1 of 0.4545 in real-world video challenges (Shihab et al., 2022).
Vehicle Counting: Key-frame-based solutions (e.g., Visual Rhythm + YOLOv8) eschew dense tracking in favor of line-crossing cue extraction and selection of informative frames, achieving >99% counting accuracy at up to 3x speedup vs. conventional tracking (Ribeiro et al., 8 Jan 2025). Such pipelines are optimal for unidirectional, static-camera, moderate-flow regimes, but break down in severe occlusion or directionally heterogeneous traffic.
Multi-Camera and Indirect Approaches: Hybrid multi-view designs combine per-view SVM/AdaBoost head detection and indirect corner-based statistical estimation (e.g., Harris corners and region weighting), the latter being markedly more robust in high-density/occlusion settings (Dittrich et al., 2017).
6. Limitations, Open Problems, and Research Directions
Current SOTA models report several limitations:
- Persistent errors in extremely dense or highly occluded scenes, particularly when detection or localization quality drops or in "lobby" scenarios (WRAE ≈ 43%) (Lu et al., 3 Jan 2026).
- Reliance on accurate pedestrian or object localization—propagating errors from initial detection into the matching and counting stages.
- Fixed frame-pair intervals may miss correspondences at highly varying temporal rates.
- Hand-tuned prior weighting (λ), and hyperparameter scheduling lacking full adaptivity (Lu et al., 3 Jan 2026, Zhu et al., 16 Jun 2025).
- For open-world VIC, failure modes include missed small/faint instances, drift under long occlusion, and susceptibility to distractor matches (Amini-Naieni et al., 18 Jun 2025).
This suggests there is substantial value in dynamic scheduling of temporal aggregation windows, adaptive motion priors, memory-augmented re-id modules for re-entrant individuals, and joint end-to-end learning of all system components (including detection, encoding, temporal association, and aggregation) (Lu et al., 3 Jan 2026, Liu et al., 2023). Multi-camera extensions and robust, cross-domain evaluation in arbitrary object classes remain major unsolved challenges (Amini-Naieni et al., 18 Jun 2025, Dittrich et al., 2017).
7. Summary Table: Representative Models
| Model | Key Mechanism | Scenario | Notable Results / Datasets | Reference |
|---|---|---|---|---|
| DRNet | Density + differentiable OT inflow | Pedestrian | MAE=12.3, WRAE=12.7% (SenseCrowd) | (Han et al., 2022) |
| OMAN++ | O2M match + grouping/displacement priors | Crowded Ped. | MAE=87.1, WRAE=19.8% (WuhanMetro) | (Lu et al., 3 Jan 2026) |
| DCFA | Density + cross-frame attention | Drone Crowds | MAE=41.0, WRAE=19.3% (MDC) | (Fan et al., 12 Mar 2025) |
| CGNet | Weak group-level, contrastive OT | Pedestrian | MAE=8.9, WRAE=12.6% (SenseCrowd) | (Liu et al., 2023) |
| CountVid | Prompt-based masklet propagation | Open-world | MAE=2.6 (TAO-Count), 50 (MOT20) | (Amini-Naieni et al., 18 Jun 2025) |
| VISTA | U-Net+ViT+CBT+duplicate removal | Retail | F1=0.4545 (AICITY22) | (Shihab et al., 2022) |
References
- (Han et al., 2022) "DR.VIC: Decomposition and Reasoning for Video Individual Counting"
- (Zhu et al., 16 Jun 2025) "Video Individual Counting With Implicit One-to-Many Matching"
- (Fan et al., 12 Mar 2025) "Video Individual Counting for Moving Drones"
- (Liu et al., 2023) "Weakly Supervised Video Individual Counting"
- (Lu et al., 3 Jan 2026) "Crowded Video Individual Counting Informed by Social Grouping and Spatial-Temporal Displacement Priors"
- (Amini-Naieni et al., 18 Jun 2025) "Open-World Object Counting in Videos"
- (Shihab et al., 2022) "VISTA: Vision Transformer enhanced by U-Net and ... for Automatic Retail Checkout"
- (Ribeiro et al., 8 Jan 2025) "Combining YOLO and Visual Rhythm for Vehicle Counting"
- (Dittrich et al., 2017) "People Counting in Crowded and Outdoor Scenes using a Hybrid Multi-Camera Approach"
These works collectively formalize the methodological landscape and establish robust, context- and prior-informed protocols for accurate, agile, and generalizable Video Individual Counting in diverse real-world scenarios.