Deep Gait Recognition

Updated 19 September 2025

Deep gait recognition is a biometric approach that extracts unique spatiotemporal patterns from human locomotion for identification in unobtrusive and long-range scenarios.
It employs diverse sensing modalities such as silhouettes, skeletons, depth maps, and inertial data, combined with CNNs, GCNs, and transformer architectures to boost robustness.
Advanced training strategies, including supervised and metric losses with multi-modal fusion, tackle challenges like occlusion, view variation, and domain shifts for improved real-world performance.

Deep gait recognition is a subfield of biometric person identification that utilizes deep learning to extract, encode, and match spatiotemporal patterns of human locomotion. Gait, defined as the unique sequence of movements that individuals exhibit during walking or running, is recognized for its suitability in unobtrusive and long-range identification scenarios, such as surveillance and secure access. Deep learning has fundamentally transformed this area, enabling systems to achieve high accuracy under challenging conditions by jointly modeling spatial and temporal dynamics in a variety of sensing modalities.

1. Body Representation and Sensing Modalities

Deep gait recognition methods are primarily determined by the form of input representation, directly influencing robustness to covariates such as view angle, clothing, or environment. The most prevalent representations are:

Silhouette-based: Binary masks of the body extracted from video sequences. These offer invariance to color and texture but can lose fine-grained joint information and are sensitive to occlusions (Sepas-Moghaddam et al., 2021, Shen et al., 2022).
Skeleton-based: Graphs of body joints obtained via pose estimation from RGB, depth, or LiDAR data. Graph convolutional networks (GCNs) leverage the topology of skeletons, capturing both spatial configuration and temporal motion (Teepe et al., 2022). These are robust to background and attire but rely on the accuracy of pose estimation.
Depth and 3D: RGB-derived or direct depth maps augment silhouettes, providing explicit geometric cues invariant to texture and lighting. Recent frameworks such as DepthGait use foundation models for depth estimation, fusing depth maps with silhouettes via multi-scale cross-level fusion mechanisms, significantly enhancing recognition especially under viewpoint or clothing variation (Li et al., 5 Aug 2025).
Inertial and Channel State Information (CSI): Wearable sensors (accelerometers, gyroscopes) and passive WiFi CSI enable gait recognition in environments where video is impractical, using 1D CNNs or hybrid CNN-RNN architectures to process raw time-series or frequency spectrum data (Zou et al., 2018, Jakkala et al., 2019, Nemes et al., 2020).

2. Deep Network Architectures

The field is typified by a rapid evolution from shallow discriminative CNNs to intricate architectures that address problems of spatial, temporal, and viewpoint modeling:

2D/3D Convolutional Neural Networks (CNNs): Early architectures such as VGR-Net use C3D backbones to extract spatiotemporal features from clips of silhouettes, often employing two-stage designs for explicit view angle and identity classification (Thapar et al., 2017).
Partial and Multi-Scale Representation Networks: State-of-the-art models partition silhouette or GCEM (Gait Convolutional Energy Map) feature maps along the vertical axis (horizontal bins) to extract local body-part representations at multiple scales. Hierarchical aggregation (e.g., through GaitSet (Chao et al., 2021) and capsule-based models (Sepas-Moghaddam et al., 2020)) enables resilience against occlusion and appearance variation.
Set-based Aggregation: Treating a gait sequence as an unordered set (deep set), models like GaitSet and TrackletGait employ permutation-invariant pooling (mean, max) or tracklet sampling to capture the statistical distribution of gait frames, which enhances robustness to missing or corrupted data (Chao et al., 2021, Zhang et al., 4 Aug 2025).
Recurrent and Attention Mechanisms: RNNs (LSTM, GRU) and attention modules are used to process either sequences of global features or split partial representations, enabling modeling of long-term dependencies and selective focus on the most discriminative regions under occlusion or covariate variation (Sepas-Moghaddam et al., 2020).
Capsule Networks and Transformer Architectures: Capsule routing and transformer models (e.g., SwinGait) model part-whole and long-range dependencies, capturing hierarchical structure and improving view and condition invariance, especially in outdoor and wild scenarios (Sepas-Moghaddam et al., 2020, Fan et al., 2023).
Graph Neural Networks: For skeleton data, multi-branch residual GCNs process joint positions, velocities, and bone angles, capitalizing on pose topology and temporal evolution (Teepe et al., 2022).

3. Training Strategies, Losses, and Domain Robustness

Training paradigms and losses play critical roles:

Supervised Learning with Cross-Entropy/Triplet Losses: Classification losses (cross-entropy), combined with metric learning (triplet, contrastive, batch-hard triplet, etc.), enforce intra-class compactness and inter-class separation, especially important for retrieval protocols (Sepas-Moghaddam et al., 2021, Zhang et al., 4 Aug 2025).
Hardness Exclusion and Robust Losses: Under wild conditions, systems like TrackletGait discard hard triplet samples deemed unreliable due to noise or occlusion by applying a dynamic threshold based on intra-batch statistics, thus improving model stability and generalization (Zhang et al., 4 Aug 2025).
Disentanglement and Generative Models: Autoencoder-based architectures separate appearance from gait cues, enforcing invariance via cross-reconstruction and similarity losses to reduce confounding by clothing or carried objects (Zhang et al., 2019).
Domain Generalization and Transfer Learning: Domain adaptation, transfer learning, and feature augmentation with synthetic data are used to mitigate domain shifts across datasets, view angles, or sensing modalities (Sokolova et al., 2017, Nemes et al., 2020). However, performance can degrade under cross-dataset evaluation due to dataset-specific biases and covariate mismatches.
Multi-Modal Fusion: Incorporating depth, silhouettes, and other biometric cues in a cross-level, attention-guided manner leverages complementary characteristics and improves recognition under diverse acquisition conditions (Li et al., 5 Aug 2025).

4. Empirical Performance and Benchmarks

Empirical comparison is usually based on rank-1 identification accuracy, Correct Classification Rate (CCR), mAP, and error rates. Current state-of-the-art results demonstrate:

Method / Dataset	CASIA-B (NM)	SUSTech1K	Gait3D	GREW	CCPG	Parameters (M)
GaitSet	96.1%	–	–	–	–	24.5
TrackletGait	–	–	77.8%	80.4%	–	10.3
DepthGait (sil+depth)	91.2%	87.6%	–	–	87.6%	–
DeepGaitV2-3D	–	–	72.8%	79.4%	–	–
SwinGait-3D	–	–	75.0%	79.3%	–	–

Recognition rates frequently exceed 90% for in-lab databases (CASIA-B, OUMVLP), but drop to 75–80% for challenging in-the-wild datasets (Gait3D, GREW). Depth and multi-modal fusion further boost performance in cross-condition scenarios. Efficient architectures with low parameter counts (e.g., TrackletGait) demonstrate that robust performance is achievable without large models by careful architectural design.

5. Applications, Security, and Privacy

Surveillance and Security: Gait is advantageous for non-intrusive long-range identification, suitable for public security, access control, and person re-identification in multi-camera networks (Thapar et al., 2017, Chao et al., 2021, Han et al., 2022).
Healthcare and Well-Being: Monitoring gait dynamics enables fall detection, activity monitoring, and clinical assessment of neurological or mobility disorders, especially with sensor- or LiDAR-based recognition (Zou et al., 2018, Han et al., 2022).
Privacy and Adversarial Concerns: Gait data is sensitive and widely accessible; privacy-preserving transformations, template protection, and defense against adversarial and presentation attacks are recognized as active challenges (Shen et al., 2022, Hanisch et al., 2022). The redundancy and interdependence of gait features complicate anonymization strategies, as both macro and micro motion cues contribute to identifiability.

6. Limitations and Future Directions

Generalization and Domain Shift: Models trained on controlled datasets degrade in wild settings due to non-periodicity, occlusion, and uncontrolled dynamics. Addressing this calls for domain adaptation, large-scale in-the-wild datasets, and better data augmentation or self-supervised learning (Zhang et al., 4 Aug 2025, Shen et al., 2022).
Multi-Modal and 3D Fusion: Incorporation of depth, LiDAR, or skeletal data, and development of multi-modal fusion strategies are critical for robustness to appearance, view, and environmental variability (Han et al., 2022, Li et al., 5 Aug 2025).
View-Invariance and Partial Representation: Explicit modeling of view angle, leveraging multi-scale partial features, and attention or capsule routing are key to achieving invariance as well as resilience to occlusion and soft biometrics confounds (Sepas-Moghaddam et al., 2020, Sepas-Moghaddam et al., 2020).
Efficient and Scalable Models: With practical constraints in surveillance and mobile computing, efficient backbones, adaptive downsampling (e.g., Haar wavelets), and sparsity-aware transformers are foregrounded as essential (Zhang et al., 4 Aug 2025, Fan et al., 2023).
Trustworthiness and Security: Fairness, bias mitigation, and protection against adversarial and template inversion attacks constitute open issues, requiring advances in system-level security and privacy regulation (Shen et al., 2022, Hanisch et al., 2022).

Deep gait recognition is now characterized by robust, multi-scale, multi-modal architectures, attuned to both the complexity and variability inherent in human locomotion and the diverse requirements of modern biometric applications. Continued progress hinges on cross-domain generalization, privacy-preservation, and efficient, flexible deployment in real-world scenarios.