WuhanMetroCrowd Dataset
- WuhanMetroCrowd is a benchmark video dataset comprised of 80 surveillance clips from diverse metro stations, showcasing a wide range of pedestrian densities and flow patterns.
- The dataset offers detailed head-based instance annotations with temporal correspondence, enabling robust analysis of individual movement in congested and occluded environments.
- It supports advanced VIC evaluations, demonstrating significant improvements in MAE, MSE, and WRAE metrics compared to previous methods.
WuhanMetroCrowd is a large-scale, high-density video dataset specifically designed for the Video Individual Counting (VIC) task under challenging metro commuting scenarios. VIC extends the conventional Video Crowd Counting (VCC) paradigm by not only estimating pedestrian counts per frame but also explicitly addressing the temporal correspondence problem: distinguishing co-existent individuals between consecutive frames in highly dynamic, occluded, and crowded urban transport environments (Lu et al., 3 Jan 2026).
1. Data Collection and Acquisition Protocol
WuhanMetroCrowd is constructed from 80 surveillance video clips recorded at 15 geographically and functionally diverse stations within the Wuhan Metro network. The sampled locations span transfer stations, major train stations, commercial districts, and dense traffic hubs. Each video is captured using standard, ceiling-mounted surveillance cameras, with native resolutions of either 720×576, 1280×720, or 1920×1080 pixels. Frames were uniformly sampled at a rate of 0.5 Hz (one frame every 2 seconds) for all annotation and matching tasks.
The dataset comprises 11,925 frames in total (reflecting an average of approximately 149 frames per clip), with individual clip durations ranging from a few frames to a maximum of 864 frames (approximately 29 minutes at 0.5 Hz). Recording conditions vary across canonical metro scene types—platforms, transfer corridors, fare gates, escalators, security areas, lobbies, and entrances/exits—covering both day and night, as well as peak hours, holidays, and festival periods.
2. Density, Flow, and Scene Complexity
WuhanMetroCrowd presents extensive variability in pedestrian densities, flow rates, and crowd movement patterns. The dataset stratifies clips into five discrete density levels for each label type ("inflow", "outflow", "pedestrian/co-existent", and "total count"):
$\begin{array}{l|ccccc} \text{Density Level} & \text{Sparse} & \text{Normal} & \text{Crowded} & \text{Packed} & \text{Jam} \ \hline \text{Inflow (}\rho\text{)} & (0,1] & (1,2] & (2,3] & (3,4] & (4,\,\infty) \end{array}$
For the “pedestrian” class, bin intervals are 5 persons per frame; for “total counts,” 10 persons per frame. Flow variation is defined for each clip as
where is the count of a given label in frame . Bin sizes for flow variation are 4 (“inflow/outflow”), 10 (“pedestrian”), and 12 (“total”). Some clips exhibit extremely abrupt flow transitions, with frame-wise inflow/outflow differences reaching up to 28 individuals in a 2-second interval.
Average relative flow variation metrics across all clips (Table I) are: inflow 0.75, outflow 0.88, pedestrian 1.87, total 2.37. The dataset further includes scenes with substantial appearance variability (camera viewpoint: front, side, back; strong scale changes due to perspective). Occlusion severity ranges from light to heavy; severely crowded or motion-blurred regions where identity is indiscernible are excluded by annotated binary masks and not evaluated.
3. Manual Annotation Protocol
Head-based instance annotation is performed with the open-source GUI X-AnyLabeling. Ten trained annotators, each using an iterative three-frame comparison protocol, assign every visible head instance a categorical label:
- “Pedestrian:” visible in both current and next frame.
- “Inflow:” appears in current frame but not the previous.
- “Outflow:” exits after the current frame.
- “Both:” appears only in the current frame (enters/exits within 2 s).
The annotation scope extends to head centers and identity-preserving instance association, with binary masks explicitly drawn over ambiguous or indiscernible regions (e.g., heavy crowding, severe motion blur), which are then disregarded for downstream benchmarking. All labels and masks are cross-validated by two independent senior checkers; no explicit inter-annotator agreement statistics are reported in the paper, but the two-checker validation represents the stated quality-control mechanism.
4. Dataset Splits, Accessibility, and Intended Usage
The dataset is partitioned into non-overlapping training, validation, and test sets, with no repetition of stations/scenes across splits:
| Subset | Number of Sequences | Proportion (%) |
|---|---|---|
| Train | 45 | 60 |
| Validation | 15 | 20 |
| Test | 20 | 20 |
Splits are performed so that each subset remains representative in terms of clip duration, scene type, and the full range of density and flow variation. The full dataset, including code and pretrained models, is publicly available for research purposes at https://github.com/tiny-smart/OMAN. No formal license is stated; access is granted for academic and non-commercial use.
5. Dataset Scale and Quantitative Summary
WuhanMetroCrowd comprises:
- Total clips: 80
- Total frames: 11,925
- Total head annotations: 223,662
- Average frames per clip: 149.06
- Average annotations per clip: ≈2,796
The density distribution patterns span from extremely sparse to maximum-congestion (“jam”) settings, affording comprehensive benchmarking for VIC tasks ranging from easy to highly congested scenarios. The dataset’s stratified density level and flow variation distributions provide coverage of standard and adversarial crowd counting conditions.
6. Benchmarking Protocol and Baseline Evaluations
The dataset is employed to benchmark several VIC models, with Mean Absolute Error (MAE), Mean Squared Error (MSE), and Weighted Relative Absolute Error (WRAE %) as reported in the canonical test split. Performance scores on WuhanMetroCrowd are as follows:
$\begin{array}{l|r r r} \text{Method} & \text{MAE}\downarrow & \text{MSE}\downarrow & \text{WRAE (\%)}\downarrow \ \hline \text{ByteTrack} & 215.5 & 477.8 & 48.3 \ \text{OC-SORT} & 287.9 & 549.6 & 66.8 \ \text{CGNet} & 172.4 & 368.5 & 37.5 \ \text{MDC} & 166.0 & 439.8 & 32.0 \ \text{OMAN} & 135.9 & 284.1 & 31.8 \ \textbf{OMAN++} & \mathbf{87.1} & \mathbf{160.2} & \mathbf{19.8} \end{array}$
Scene-type-wise WRAE (%) for OMAN++:
$\begin{array}{l|cccccc} \text{Scene} & \text{Platform} & \text{Transfer} & \text{FareGate} & \text{Lobby} & \text{Escalator} & \text{Exit/In} \ \hline \text{WRAE (\%)} & 15.4 & 28.9 & 11.5 & 43.5 & 23.3 & 45.4 \end{array}$
Relative to MDC, OMAN++ achieves a 47.5% reduction in MAE, 63.6% in MSE, and 38.1% in WRAE. This suggests that models leveraging social grouping and spatial-temporal displacement priors, as introduced in OMAN++, can provide substantial accuracy improvements in heavy pedestrian congestion (Lu et al., 3 Jan 2026).
7. Significance and Application Domain
WuhanMetroCrowd is one of the first datasets to support rigorous evaluation on the temporal correspondence problem inherent in VIC for metro-scale pedestrian flows. Its design—characterizing sparse-to-jammed densities, temporally abrupt crowd dynamics, occlusion extremes, and diverse scene typologies—positions it as a reference benchmark for developing robust individual counting models in realistic, high-density crowd environments. A plausible implication is its utility for research in public safety analytics, station design, transportation planning, and algorithmic benchmarking for crowd-related computer vision systems.
For further methodological details, density definitions, and baseline code, refer to the official repository: https://github.com/tiny-smart/OMAN (Lu et al., 3 Jan 2026).