- The paper introduces a novel framework leveraging multi-branch deep residual networks and attention-based fusion to overcome covariate interference in gait recognition.
- It employs HRNet for precise keypoint estimation and decomposes pose sequences into body proportion, gait velocity, and skeletal motion streams using ResNet-50.
- Experimental results demonstrate significant accuracy gains, with up to 97.36% NM accuracy and reduced model complexity for real-world gait analysis.
Skeleton-Based Multi-Feature Fusion for Robust Gait Recognition
Overview and Motivation
The paper "Gait Recognition via Deep Residual Networks and Multi-Branch Feature Fusion" (2604.27353) addresses the challenges inherent in biometric gait identification under covariate interference—including viewpoint variations, clothing changes, and carrying conditions—by presenting a framework that synergistically integrates multiple discriminative gait features via a multi-branch deep residual network architecture. The principal workflow leverages HRNet for robust keypoint estimation even in low-resolution imagery, followed by decomposition of pose sequences into three semantically complementary branches (body proportion, gait velocity, skeletal motion). Hierarchical feature extraction is performed via ResNet-50, with integration mediated by a channel-attention-inspired Multi-Branch Feature Fusion (MFF) module for dynamic weighting and exploitation of inter-feature complementarity.
Gait recognition modalities are categorized into appearance-based and model-based paradigms. Silhouette-based approaches (e.g., GaitSet [chao2022gaitset], GaitPart [fan2020gaitpart], GaitDAN [huang2024gaitdan]) excel in controlled settings but exhibit susceptibility to occlusion, clothing, and carried objects due to the mutable nature of outer contours. Model-based methods utilizing skeletal representations (PTSN [liao2017ptsn], GaitGraph [teepe2021gaitgraph], GPGait [fu2023gpgait], SkeletonGait++ [fan2024skeletongait]) mitigate these vulnerabilities but mostly focus on either spatial or temporal information, underutilizing full biometric potential.
The presented framework builds on advances in structured feature extraction, attention-based fusion, and hierarchical deep representation found in document intelligence and cross-modal learning. The MFF module reflects principles from channel-wise attention and adaptive feature selection [tang2022fewcould, tang2024textsquare], enabling efficient mid-level integration of heterogeneous feature streams and overcoming limitations of late fusion strategies prevalent in prior biometric recognition architectures.
Methodological Framework
Skeletal Keypoint Estimation via HRNet
Keypoint localization is performed using HRNet [sun2019hrnet], which preserves high-resolution spatial information through parallel multi-resolution subnetworks. This design obviates spatial degradation typical of deep convolutional backbones and yields robust keypoint accuracy even for surveillance images below 256×256 pixels, with lower-limb localization critical for gait sequence reliability.
Temporal segmentation of walking sequences utilizes a Hamming distance-based similarity metric over lower-limb keypoints. Oscillatory patterns in frame-to-frame similarity waveforms demarcate canonical gait cycles, informing optimal temporal window selection for downstream feature extraction. Mean cycle durations across training batches are used to calibrate stride length hyperparameters for adaptive coverage.
Multi-Branch Feature Modeling
Three branches are constructed:
- Body Proportion: Encodes anthropometric stability via adjacency features and relative joint positions against the skeleton center.
- Gait Velocity: Captures temporal dynamics, fusing instantaneous per-frame differentials and six-frame interval displacements for multi-scale rhythm representation.
- Skeletal Motion: Represents kinematic properties with bone length vectors and articulation angles computed from spatial keypoints.
Each stream is processed independently through a ResNet-50 backbone, leveraging skip connections for gradient stability and bottleneck structures for parameter efficiency.
Multi-Branch Feature Fusion
The MFF module aggregates spatial and temporal feature maps via concatenation and global average pooling, followed by learned dimensionality reduction. Branch-wise excitation is computed through sigmoid activations, resulting in dynamic attention weighting per feature branch. Element-wise recalibration of branch outputs is followed by global pooling and concatenation, filtering informative channels and suppressing redundant or uninformative features. This mechanism enables targeted exploitation of body stability, temporal rhythm, and kinematic detail per subject.
Experimental Evaluation
Keypoint Detection
HRNet keypoint accuracy evaluated using Head-normalized Probability of Correct Keypoint (PCKh) on MPII yields 95%+ for lower extremities, establishing sufficiency for fine-grained gait analysis.
Component Contribution and Ablation Analysis
Ablation on CASIA-B demonstrates the superior performance of multi-feature fusion. Velocity-only baselines are deficient (overall accuracy 65.55%, CL accuracy 56.84%). Dual-feature combinations incrementally improve robustness, but full three-branch fusion achieves 94.52% NM accuracy and 87.81% overall, with 4.50% absolute gain over the best dual-feature configuration.
Cross-View and Covariate Robustness
Across 11 viewing angles, the framework maintains NM accuracy between 89.14%–97.36%. Degradation under challenging conditions (BG, CL) is mitigated via the stability of the body proportion branch and compensatory velocity dynamics.
Benchmarking Against State-of-the-Art
On CASIA-B, the proposed model achieves the best overall accuracy (89.4%), notably outperforming prior skeleton and silhouette methods under clothing variation. On a self-collected outdoor dataset, the framework exhibits a 4.1% absolute gain over SkeletonGait++ [fan2024skeletongait] in overall accuracy, demonstrating effective generalization from controlled to unconstrained environments—a challenge previously marked by 15%+ performance drops in comparable methods.
Model Efficiency
Parameter count is minimized (3.72M) compared to other recognized approaches (e.g., BigGait's 30.82M [ye2024biggait]), with no compromise in recognition accuracy. Real-time deployment is feasible with single-frame inference latency of 0.18s.
Implications and Future Directions
This research demonstrates that skeletal-based recognition augmented by multi-feature fusion and attention mechanisms closes the performance gap between indoor and outdoor settings, establishing practical viability for surveillance and forensic applications. The fusion of stable anthropometry and robust temporal dynamics greatly enhances resilience to covariate interference, particularly under clothing and occlusion.
Future directions include:
- Extension to multi-modal fusion architectures integrating RGB, depth, and infrared streams.
- Domain adaptation strategies for broader generalization to unseen environments and demographics.
- Incorporation of large pretrained vision-LLMs and unsupervised paradigms for further robustness.
- Real-time deployment scalability with edge hardware and distributed inference frameworks.
Conclusion
This paper provides a rigorous skeleton-based gait recognition pipeline with deep multi-feature fusion, achieving superior accuracy and robustness under real-world variations. The combination of HRNet-driven keypoint localization, ResNet-50 hierarchical feature extraction, and channel-attention-based mid-level fusion sets a new standard for practical gait identification systems suitable for intelligent surveillance and biometric contexts. The framework exhibits minimal parameter footprint and excellent generalization across indoor and outdoor scenarios, offering broad implications for advanced behavioral biometrics and open-world recognition systems.