TrackletGait: Robust Gait Recognition
- TrackletGait is a robust gait recognition framework that processes short, variable-length tracklets to identify individuals in unconstrained settings.
- It employs innovative techniques such as Random Tracklet Sampling, Haar Wavelet–based Downsampling, and Hardness Exclusion Triplet Loss to enhance discriminative power.
- Experimental results demonstrate improved recognition rates on benchmarks like Gait3D, GREW, OU-MVLP, and CASIA-B, even under occlusion and noisy conditions.
TrackletGait is a robust state-of-the-art framework designed for gait recognition in unconstrained “in-the-wild” scenarios characterized by non-periodic motion, frequent occlusions, background clutter, and highly variable video quality. Unlike traditional gait recognition pipelines that rely on well-aligned, multi-cycle walking sequences in controlled environments, TrackletGait specifically targets the practical challenges of real surveillance data, where silhouettes may be fragmentary, occluded, or captured only once as an individual passes a camera. TrackletGait operates on fragmented “tracklets”—short contiguous subsequences of silhouettes—using an advanced suite of sampling, feature extraction, and loss strategies to optimize robustness and discriminatory power for person identification based on gait (Zhang et al., 4 Aug 2025).
1. Framework Overview and Problem Formulation
TrackletGait addresses the limitations of earlier methods by (a) decoupling gait representation from reliance on full walking cycles, (b) maximizing the effective use of unreliable, short, or noisy silhouette sequences, and (c) structurally filtering out detrimental training samples. The pipeline comprises three major modules:
- Random Tracklet Sampling (RTS): Stochastic sampling of variable-length tracklets from each sequence, providing temporal coverage and diversity by drawing short consecutive silhouette fragments at random offsets.
- Haar Wavelet–based Downsampling (HWD): A substitution for conventional strided convolutions or pooling in spatial downsampling units, using a lossless two-dimensional Haar discrete wavelet transform to retain both low- and high-frequency cues.
- Hardness Exclusion Triplet Loss (HE-Triplet): An adaptation of batch-all triplet mining, which excludes extremely hard anchor–positive samples (defined by excessive intra-class distance), mitigating the effect of low-quality, noisy, or occluded silhouettes in optimization.
This architecture is instantiated as a 22-layer ResNet-style backbone with P3D (pseudo-3D) residual units, facilitating both temporal and spatial modeling (Zhang et al., 4 Aug 2025).
2. Random Tracklet Sampling (RTS)
Random Tracklet Sampling generalizes prior sequence subsampling methods to balance local motion–detailed analysis and global diversity. For a silhouette sequence , a fixed number of frames is selected by repeatedly sampling short consecutive fragments (“tracklets”):
The tracklet length is drawn from a discrete pmf:
Typical configuration: . This sampling scheme encompasses prior approaches as special cases (e.g., GaitSet’s random frame sampling, consecutive sampling). RTS ensures that both brief, accurate and long, temporally diverse sequences are utilized, enhancing robustness to partial occlusions and complex walking patterns (Zhang et al., 4 Aug 2025).
3. Haar Wavelet–based Downsampling (HWD)
TrackletGait replaces strided convolutions in spatial downsampling with a one-stage, lossless 2D Haar discrete wavelet transform (DWT). Given a convolutional feature map , HWD decomposes signals into four subbands:
Channels are concatenated to , then projected back to channels with a convolution. This approach preserves edge and texture cues—critical for modeling the silhouette boundaries—without the signal loss or blurring associated with learned strided convolutions or pooling. HWD is integrated into each downsampling P3D block of the backbone, maintaining detail throughout the network’s spatial hierarchy (Zhang et al., 4 Aug 2025).
4. Hardness Exclusion Triplet Loss
The Hardness Exclusion Triplet Loss modifies batch-all triplet mining by excluding anchor–positive pairs whose intra-class distance exceeds a dynamic threshold:
Within every batch, only triplets satisfying contribute gradient signal:
with margin parameter . Selection of yields optimal performance on benchmark datasets. This mechanism discards the most corrupted or ambiguous positives—often heavily occluded or nearly blank silhouettes—while retaining challenging but informative cases, leading to more stable convergence and higher recognition rates (Zhang et al., 4 Aug 2025).
5. Network Architecture and Implementation
TrackletGait’s backbone comprises a ResNet-22 with P3D units incorporating HWD-based spatial downsampling. Key architectural details:
- Base channel width: 64, for a total of ≈10.3M parameters.
- Temporal aggregation: Max-pooling over input sequence length (typically ), facilitating robustness to sporadic missing or corrupted frames.
- Horizontal Pooling: Features are divided into 16 horizontal bins, each aggregated with global average- and max-pooling, then projected to a 256-dimensional vector.
- Embedding head: Batch normalization neck ("BNNeck") yielding final 256-d gait descriptors.
- Training: SGD optimizer, lr = 0.1, momentum = 0.9, weight decay = 5e-4; learning rate scheduled at 40k/80k/100k steps; typical training for 120–180k iterations.
Augmentations include random horizontal flip and silhouette normalization. Batch formation uses 32 identities × 4 sequences per batch, exploiting diverse intra-class pairings for HE-Triplet mining (Zhang et al., 4 Aug 2025).
6. Experimental Results and Comparative Analysis
TrackletGait demonstrates state-of-the-art (SOTA) performance on established wild and lab gait recognition benchmarks. Notable outcomes include:
| Method | Params (M) | Gait3D R1 (%) | GREW R1 (%) |
|---|---|---|---|
| DeepGaitV2-P3D-64 | 11.1 | 74.4 | 77.7 |
| TrackletGait-64 | 10.3 | 77.8 | 80.4 |
TrackletGait also achieves 91.9% and 94.1% Rank-1 accuracy on OU-MVLP and CASIA-B, respectively, closely matching SOTA on controlled lab datasets. Component ablations confirm the interplay between modules: introducing RTS increases Gait3D Rank-1 from 75.9% to 77.0%; adding HWD brings it to 77.2%; and including HE-Triplet further elevates to 77.8% (Zhang et al., 4 Aug 2025).
Analysis of sampling-length trade-offs reveals that shorter tracklets (e.g., ) are advantageous for high-variance datasets like Gait3D; longer tracklets () benefit datasets with more consistent lateral walking, such as GREW. Optimal HE-Triplet performance is obtained with , outperforming both batch-all and batch-hard mining strategies.
Qualitative batch analysis indicates that excluded pairs under HE-Triplet reflect severely degraded input (blank/occluded frames), supporting the premise that strategic “exclusion” enhances discriminative learning (Zhang et al., 4 Aug 2025).
7. Related Approaches and Distinctions
TrackletGait’s tracklet-based representation contrasts with skeleton-based pipelines such as WildGait (Cosma et al., 2021), which model skeleton dynamics using spatiotemporal graph convolutions and operate on automatically annotated joint sequences from surveillance streams. While both target unconstrained settings and privacy-sensitive data, WildGait’s pipeline employs (a) automatic pose extraction, (b) weak “pseudo-identity” labels assigned via intra-camera tracking, and (c) a supervised contrastive loss optimized for cross-domain transfer (Cosma et al., 2021). TrackletGait, in contrast, directly leverages silhouette fragments and explicitly rejects detrimental training samples via loss design.
A plausible implication is that future frameworks may hybridize tracklet-based silhouette strategies with graph-based skeleton modeling to unite the privacy, robustness, and transfer characteristics observed in both research lines.