Papers
Topics
Authors
Recent
Search
2000 character limit reached

TrackletGait: Robust Gait Recognition

Updated 23 February 2026
  • TrackletGait is a robust gait recognition framework that processes short, variable-length tracklets to identify individuals in unconstrained settings.
  • It employs innovative techniques such as Random Tracklet Sampling, Haar Wavelet–based Downsampling, and Hardness Exclusion Triplet Loss to enhance discriminative power.
  • Experimental results demonstrate improved recognition rates on benchmarks like Gait3D, GREW, OU-MVLP, and CASIA-B, even under occlusion and noisy conditions.

TrackletGait is a robust state-of-the-art framework designed for gait recognition in unconstrained “in-the-wild” scenarios characterized by non-periodic motion, frequent occlusions, background clutter, and highly variable video quality. Unlike traditional gait recognition pipelines that rely on well-aligned, multi-cycle walking sequences in controlled environments, TrackletGait specifically targets the practical challenges of real surveillance data, where silhouettes may be fragmentary, occluded, or captured only once as an individual passes a camera. TrackletGait operates on fragmented “tracklets”—short contiguous subsequences of silhouettes—using an advanced suite of sampling, feature extraction, and loss strategies to optimize robustness and discriminatory power for person identification based on gait (Zhang et al., 4 Aug 2025).

1. Framework Overview and Problem Formulation

TrackletGait addresses the limitations of earlier methods by (a) decoupling gait representation from reliance on full walking cycles, (b) maximizing the effective use of unreliable, short, or noisy silhouette sequences, and (c) structurally filtering out detrimental training samples. The pipeline comprises three major modules:

  • Random Tracklet Sampling (RTS): Stochastic sampling of variable-length tracklets from each sequence, providing temporal coverage and diversity by drawing short consecutive silhouette fragments at random offsets.
  • Haar Wavelet–based Downsampling (HWD): A substitution for conventional strided convolutions or pooling in spatial downsampling units, using a lossless two-dimensional Haar discrete wavelet transform to retain both low- and high-frequency cues.
  • Hardness Exclusion Triplet Loss (HE-Triplet): An adaptation of batch-all triplet mining, which excludes extremely hard anchor–positive samples (defined by excessive intra-class distance), mitigating the effect of low-quality, noisy, or occluded silhouettes in optimization.

This architecture is instantiated as a 22-layer ResNet-style backbone with P3D (pseudo-3D) residual units, facilitating both temporal and spatial modeling (Zhang et al., 4 Aug 2025).

2. Random Tracklet Sampling (RTS)

Random Tracklet Sampling generalizes prior sequence subsampling methods to balance local motion–detailed analysis and global diversity. For a silhouette sequence X={x1,...,xM}X = \{x_1, ..., x_M\}, a fixed number NN of frames is selected by repeatedly sampling short consecutive fragments (“tracklets”):

Xr=[xr,xr+s,,xr+(l1)s],rUniform(1,M(l1)s) u=N/l,XRTS-l=i=1uXri\begin{aligned} &\mathcal{X}_r = [\,x_{r},\,x_{r+s},\,\dots,\,x_{r+(l-1)s}\,],\quad r\sim\mathrm{Uniform}(1,\,M-(l-1)s) \ &u = N/l,\quad \mathcal{X}_{\text{RTS-}l} = \bigcup_{i=1}^{u}\mathcal{X}_{r_i} \end{aligned}

The tracklet length ll is drawn from a discrete pmf:

P(l)={p8,l=8 p16,l=16 p32,l=32,p8+p16+p32=1P(l) = \begin{cases} p_8, & l=8 \ p_{16}, & l=16 \ p_{32}, & l=32 \end{cases}, \quad p_8 + p_{16} + p_{32} = 1

Typical configuration: N=32,p8=0.2,p16=0.3,p32=0.5N=32,\, p_8=0.2,\, p_{16}=0.3,\, p_{32}=0.5. This sampling scheme encompasses prior approaches as special cases (e.g., GaitSet’s random frame sampling, consecutive sampling). RTS ensures that both brief, accurate and long, temporally diverse sequences are utilized, enhancing robustness to partial occlusions and complex walking patterns (Zhang et al., 4 Aug 2025).

3. Haar Wavelet–based Downsampling (HWD)

TrackletGait replaces strided convolutions in spatial downsampling with a one-stage, lossless 2D Haar discrete wavelet transform (DWT). Given a convolutional feature map FRC×H×WF \in \mathbb{R}^{C \times H \times W}, HWD decomposes signals into four subbands:

{LL,LH,HL,HH}RC×H2×W2\left\{\text{LL},\,\text{LH},\,\text{HL},\,\text{HH}\right\} \in \mathbb{R}^{C \times \frac H2 \times \frac W2}

Channels are concatenated to R4C×H2×W2\mathbb{R}^{4C \times \frac H2 \times \frac W2}, then projected back to CC channels with a 1×11 \times 1 convolution. This approach preserves edge and texture cues—critical for modeling the silhouette boundaries—without the signal loss or blurring associated with learned strided convolutions or pooling. HWD is integrated into each downsampling P3D block of the backbone, maintaining detail throughout the network’s spatial hierarchy (Zhang et al., 4 Aug 2025).

4. Hardness Exclusion Triplet Loss

The Hardness Exclusion Triplet Loss modifies batch-all triplet mining by excluding anchor–positive pairs whose intra-class distance dapd_{ap} exceeds a dynamic threshold:

dthresh=dmean+α(dmaxdmean),α(0,1)d_{\text{thresh}} = d_{\text{mean}} + \alpha \bigl( d_{\text{max}} - d_{\text{mean}} \bigr),\quad \alpha \in (0,1)

Within every batch, only triplets satisfying dapdthreshd_{ap} \leq d_{\text{thresh}} contribute gradient signal:

LHE=(a,p,n)[  max{0,dapdan+ξ}  ×1[dapdthresh]]\mathcal{L}_{\mathrm{HE}} = \sum_{(a,p,n)} \left[\; \max\{0, d_{ap} - d_{an} + \xi\}\; \times \mathbf{1}[d_{ap} \le d_{\text{thresh}} ] \right]

with margin parameter ξ\xi. Selection of α=2/3\alpha = 2/3 yields optimal performance on benchmark datasets. This mechanism discards the most corrupted or ambiguous positives—often heavily occluded or nearly blank silhouettes—while retaining challenging but informative cases, leading to more stable convergence and higher recognition rates (Zhang et al., 4 Aug 2025).

5. Network Architecture and Implementation

TrackletGait’s backbone comprises a ResNet-22 with P3D units incorporating HWD-based spatial downsampling. Key architectural details:

  • Base channel width: 64, for a total of ≈10.3M parameters.
  • Temporal aggregation: Max-pooling over input sequence length (typically N=32N=32), facilitating robustness to sporadic missing or corrupted frames.
  • Horizontal Pooling: Features are divided into 16 horizontal bins, each aggregated with global average- and max-pooling, then projected to a 256-dimensional vector.
  • Embedding head: Batch normalization neck ("BNNeck") yielding final 256-d gait descriptors.
  • Training: SGD optimizer, lr = 0.1, momentum = 0.9, weight decay = 5e-4; learning rate scheduled at 40k/80k/100k steps; typical training for 120–180k iterations.

Augmentations include random horizontal flip and silhouette normalization. Batch formation uses 32 identities × 4 sequences per batch, exploiting diverse intra-class pairings for HE-Triplet mining (Zhang et al., 4 Aug 2025).

6. Experimental Results and Comparative Analysis

TrackletGait demonstrates state-of-the-art (SOTA) performance on established wild and lab gait recognition benchmarks. Notable outcomes include:

Method Params (M) Gait3D R1 (%) GREW R1 (%)
DeepGaitV2-P3D-64 11.1 74.4 77.7
TrackletGait-64 10.3 77.8 80.4

TrackletGait also achieves 91.9% and 94.1% Rank-1 accuracy on OU-MVLP and CASIA-B, respectively, closely matching SOTA on controlled lab datasets. Component ablations confirm the interplay between modules: introducing RTS increases Gait3D Rank-1 from 75.9% to 77.0%; adding HWD brings it to 77.2%; and including HE-Triplet further elevates to 77.8% (Zhang et al., 4 Aug 2025).

Analysis of sampling-length trade-offs reveals that shorter tracklets (e.g., l=8l=8) are advantageous for high-variance datasets like Gait3D; longer tracklets (l=32l=32) benefit datasets with more consistent lateral walking, such as GREW. Optimal HE-Triplet performance is obtained with α=2/3\alpha = 2/3, outperforming both batch-all and batch-hard mining strategies.

Qualitative batch analysis indicates that excluded pairs under HE-Triplet reflect severely degraded input (blank/occluded frames), supporting the premise that strategic “exclusion” enhances discriminative learning (Zhang et al., 4 Aug 2025).

TrackletGait’s tracklet-based representation contrasts with skeleton-based pipelines such as WildGait (Cosma et al., 2021), which model skeleton dynamics using spatiotemporal graph convolutions and operate on automatically annotated joint sequences from surveillance streams. While both target unconstrained settings and privacy-sensitive data, WildGait’s pipeline employs (a) automatic pose extraction, (b) weak “pseudo-identity” labels assigned via intra-camera tracking, and (c) a supervised contrastive loss optimized for cross-domain transfer (Cosma et al., 2021). TrackletGait, in contrast, directly leverages silhouette fragments and explicitly rejects detrimental training samples via loss design.

A plausible implication is that future frameworks may hybridize tracklet-based silhouette strategies with graph-based skeleton modeling to unite the privacy, robustness, and transfer characteristics observed in both research lines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TrackletGait.