Gait Recognition via Deep Residual Networks and Multi-Branch Feature Fusion

Published 30 Apr 2026 in cs.CV | (2604.27353v1)

Abstract: Gait recognition has emerged as a compelling biometric modality for surveillance and security applications, offering inherent advantages such as non-intrusiveness, resistance to disguise, and long-range identification capability. However, prevailing approaches struggle to comprehensively capture and exploit the rich biometric cues embedded in human locomotion, particularly under covariate interference including viewpoint variation, clothing change, and carrying conditions. In this paper, we present a high-precision gait recognition framework that deeply extracts and synergistically fuses gait dynamics with body shape characteristics through a multi-branch architecture grounded in deep residual learning. Specifically, we first employ the High-Resolution Network (HRNet) to perform robust skeletal keypoint estimation, preserving fine-grained spatial information even under low-resolution inputs. We then construct three complementary feature branches -- body proportion, gait velocity, and skeletal motion -- from the extracted pose sequences. A 50-layer Residual Network (ResNet-50) backbone is leveraged within a deep feature extraction module to capture hierarchically rich and discriminative representations. To effectively integrate heterogeneous feature streams, we design a Multi-Branch Feature Fusion (MFF) module inspired by channel-wise attention mechanisms, which dynamically allocates contribution weights across branches through learned activation parameters. Extensive experiments on the cross-view multi-condition CASIA-B benchmark demonstrate that our method achieves a Rank-1 accuracy of 94.52\% under normal walking, with the best recognition performance among skeleton-based methods for the coat-wearing condition.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a novel framework leveraging multi-branch deep residual networks and attention-based fusion to overcome covariate interference in gait recognition.
It employs HRNet for precise keypoint estimation and decomposes pose sequences into body proportion, gait velocity, and skeletal motion streams using ResNet-50.
Experimental results demonstrate significant accuracy gains, with up to 97.36% NM accuracy and reduced model complexity for real-world gait analysis.

Skeleton-Based Multi-Feature Fusion for Robust Gait Recognition

Overview and Motivation

The paper "Gait Recognition via Deep Residual Networks and Multi-Branch Feature Fusion" (2604.27353) addresses the challenges inherent in biometric gait identification under covariate interference—including viewpoint variations, clothing changes, and carrying conditions—by presenting a framework that synergistically integrates multiple discriminative gait features via a multi-branch deep residual network architecture. The principal workflow leverages HRNet for robust keypoint estimation even in low-resolution imagery, followed by decomposition of pose sequences into three semantically complementary branches (body proportion, gait velocity, skeletal motion). Hierarchical feature extraction is performed via ResNet-50, with integration mediated by a channel-attention-inspired Multi-Branch Feature Fusion (MFF) module for dynamic weighting and exploitation of inter-feature complementarity.

Gait recognition modalities are categorized into appearance-based and model-based paradigms. Silhouette-based approaches (e.g., GaitSet [chao2022gaitset], GaitPart [fan2020gaitpart], GaitDAN [huang2024gaitdan]) excel in controlled settings but exhibit susceptibility to occlusion, clothing, and carried objects due to the mutable nature of outer contours. Model-based methods utilizing skeletal representations (PTSN [liao2017ptsn], GaitGraph [teepe2021gaitgraph], GPGait [fu2023gpgait], SkeletonGait++ [fan2024skeletongait]) mitigate these vulnerabilities but mostly focus on either spatial or temporal information, underutilizing full biometric potential.

The presented framework builds on advances in structured feature extraction, attention-based fusion, and hierarchical deep representation found in document intelligence and cross-modal learning. The MFF module reflects principles from channel-wise attention and adaptive feature selection [tang2022fewcould, tang2024textsquare], enabling efficient mid-level integration of heterogeneous feature streams and overcoming limitations of late fusion strategies prevalent in prior biometric recognition architectures.

Methodological Framework

Skeletal Keypoint Estimation via HRNet

Keypoint localization is performed using HRNet [sun2019hrnet], which preserves high-resolution spatial information through parallel multi-resolution subnetworks. This design obviates spatial degradation typical of deep convolutional backbones and yields robust keypoint accuracy even for surveillance images below $256\times256$ pixels, with lower-limb localization critical for gait sequence reliability.

Gait Cycle Extraction

Temporal segmentation of walking sequences utilizes a Hamming distance-based similarity metric over lower-limb keypoints. Oscillatory patterns in frame-to-frame similarity waveforms demarcate canonical gait cycles, informing optimal temporal window selection for downstream feature extraction. Mean cycle durations across training batches are used to calibrate stride length hyperparameters for adaptive coverage.

Multi-Branch Feature Modeling

Three branches are constructed:

Body Proportion: Encodes anthropometric stability via adjacency features and relative joint positions against the skeleton center.
Gait Velocity: Captures temporal dynamics, fusing instantaneous per-frame differentials and six-frame interval displacements for multi-scale rhythm representation.
Skeletal Motion: Represents kinematic properties with bone length vectors and articulation angles computed from spatial keypoints.

Each stream is processed independently through a ResNet-50 backbone, leveraging skip connections for gradient stability and bottleneck structures for parameter efficiency.

Multi-Branch Feature Fusion

The MFF module aggregates spatial and temporal feature maps via concatenation and global average pooling, followed by learned dimensionality reduction. Branch-wise excitation is computed through sigmoid activations, resulting in dynamic attention weighting per feature branch. Element-wise recalibration of branch outputs is followed by global pooling and concatenation, filtering informative channels and suppressing redundant or uninformative features. This mechanism enables targeted exploitation of body stability, temporal rhythm, and kinematic detail per subject.

Experimental Evaluation

Keypoint Detection

HRNet keypoint accuracy evaluated using Head-normalized Probability of Correct Keypoint (PCKh) on MPII yields 95%+ for lower extremities, establishing sufficiency for fine-grained gait analysis.

Component Contribution and Ablation Analysis

Ablation on CASIA-B demonstrates the superior performance of multi-feature fusion. Velocity-only baselines are deficient (overall accuracy 65.55%, CL accuracy 56.84%). Dual-feature combinations incrementally improve robustness, but full three-branch fusion achieves 94.52% NM accuracy and 87.81% overall, with 4.50% absolute gain over the best dual-feature configuration.

Cross-View and Covariate Robustness

Across 11 viewing angles, the framework maintains NM accuracy between 89.14%–97.36%. Degradation under challenging conditions (BG, CL) is mitigated via the stability of the body proportion branch and compensatory velocity dynamics.

Benchmarking Against State-of-the-Art

On CASIA-B, the proposed model achieves the best overall accuracy (89.4%), notably outperforming prior skeleton and silhouette methods under clothing variation. On a self-collected outdoor dataset, the framework exhibits a 4.1% absolute gain over SkeletonGait++ [fan2024skeletongait] in overall accuracy, demonstrating effective generalization from controlled to unconstrained environments—a challenge previously marked by 15%+ performance drops in comparable methods.

Model Efficiency

Parameter count is minimized (3.72M) compared to other recognized approaches (e.g., BigGait's 30.82M [ye2024biggait]), with no compromise in recognition accuracy. Real-time deployment is feasible with single-frame inference latency of 0.18s.

Implications and Future Directions

This research demonstrates that skeletal-based recognition augmented by multi-feature fusion and attention mechanisms closes the performance gap between indoor and outdoor settings, establishing practical viability for surveillance and forensic applications. The fusion of stable anthropometry and robust temporal dynamics greatly enhances resilience to covariate interference, particularly under clothing and occlusion.

Future directions include:

Extension to multi-modal fusion architectures integrating RGB, depth, and infrared streams.
Domain adaptation strategies for broader generalization to unseen environments and demographics.
Incorporation of large pretrained vision-LLMs and unsupervised paradigms for further robustness.
Real-time deployment scalability with edge hardware and distributed inference frameworks.

Conclusion

This paper provides a rigorous skeleton-based gait recognition pipeline with deep multi-feature fusion, achieving superior accuracy and robustness under real-world variations. The combination of HRNet-driven keypoint localization, ResNet-50 hierarchical feature extraction, and channel-attention-based mid-level fusion sets a new standard for practical gait identification systems suitable for intelligent surveillance and biometric contexts. The framework exhibits minimal parameter footprint and excellent generalization across indoor and outdoor scenarios, offering broad implications for advanced behavioral biometrics and open-world recognition systems.

Markdown Report Issue