Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BigGait: Learning Gait Representation You Want by Large Vision Models (2402.19122v2)

Published 29 Feb 2024 in cs.CV

Abstract: Gait recognition stands as one of the most pivotal remote identification technologies and progressively expands across research and industry communities. However, existing gait recognition methods heavily rely on task-specific upstream driven by supervised learning to provide explicit gait representations like silhouette sequences, which inevitably introduce expensive annotation costs and potential error accumulation. Escaping from this trend, this work explores effective gait representations based on the all-purpose knowledge produced by task-agnostic Large Vision Models (LVMs) and proposes a simple yet efficient gait framework, termed BigGait. Specifically, the Gait Representation Extractor (GRE) within BigGait draws upon design principles from established gait representations, effectively transforming all-purpose knowledge into implicit gait representations without requiring third-party supervision signals. Experiments on CCPG, CAISA-B* and SUSTech1K indicate that BigGait significantly outperforms the previous methods in both within-domain and cross-domain tasks in most cases, and provides a more practical paradigm for learning the next-generation gait representation. Finally, we delve into prospective challenges and promising directions in LVMs-based gait recognition, aiming to inspire future work in this emerging topic. The source code is available at https://github.com/ShiqiYu/OpenGait.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Language models are few-shot learners, 2020.
  2. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
  3. Gaitset: Cross-view gait recognition through utilizing gait as a deep set. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3467–3478, 2022.
  4. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
  5. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, pages 1597–1607. PMLR, 2020.
  6. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15750–15758, 2021.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  9. Metagait: Learning to learn an omni sample adaptive representation for gait recognition. In European Conference on Computer Vision, pages 357–374. Springer, 2022.
  10. Gaitpart: Temporal part-based model for gait recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14225–14233, 2020.
  11. Exploring deep models for practical gait recognition. arXiv preprint arXiv:2303.03301, 2023a.
  12. Learning gait representation from massive unlabelled walking videos: A benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023b.
  13. Opengait: Revisiting gait recognition towards better practicality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9707–9716, 2023c.
  14. Skeletongait: Gait recognition using skeleton maps. arXiv preprint arXiv:2311.13444, 2023d.
  15. Gpgait: Generalized pose-based gait recognition. arXiv preprint arXiv:2303.05234, 2023.
  16. Bottom-up human pose estimation via disentangled keypoint regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14676–14686, 2021.
  17. Self-supervised pretraining of visual features in the wild. arXiv preprint arXiv:2103.01988, 2021.
  18. Appearance-preserving 3d convolution for video-based person re-identification. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 228–243. Springer, 2020.
  19. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  20. Unet 3+: A full-scale connected unet for medical image segmentation. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 1055–1059. IEEE, 2020.
  21. Context-sensitive temporal feature learning for gait recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12909–12918, 2021a.
  22. 3d local convolutional neural networks for gait recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14920–14929, 2021b.
  23. Segment anything, 2023.
  24. Self-correction for human parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6):3260–3271, 2020a.
  25. An in-depth exploration of person re-identification and gait recognition in cloth-changing conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13824–13833, 2023.
  26. Selective kernel networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 510–519, 2019.
  27. End-to-end model-based gait recognition. In Proceedings of the Asian Conference on Computer Vision, 2020b.
  28. Gaitedge: Beyond plain end-to-end gait recognition for better practicality. In Computer Vision – ECCV 2022, 2022.
  29. Pose-based temporal-spatial network (ptsn) for gait recognition with carrying and clothing variations. In Chinese conference on biometric recognition, pages 474–483. Springer, 2017.
  30. Gait recognition via effective global-local feature representation and local temporal aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14648–14656, 2021.
  31. Cdgnet: Class distribution guided network for human parsing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4473–4482, 2022.
  32. Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 851–866. 2023.
  33. Automatic recognition by gait. Proceedings of the IEEE, 94(11):2013–2024, 2006.
  34. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  35. Learning rich features for gait recognition by integrating skeletons and silhouettes. Multimedia Tools and Applications, pages 1–22, 2023.
  36. Learning transferable visual models from natural language supervision, 2021.
  37. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
  38. Lidargait: Benchmarking 3d gait recognition with point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1054–1063, 2023.
  39. A 3x3 isotropic gradient operator for image processing. a talk at the Stanford Artificial Project in, pages 271–272, 1968.
  40. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5693–5703, 2019.
  41. Multi-view large population gait dataset and its performance evaluation for cross-view gait recognition. IPSJ Transactions on Computer Vision and Applications, 10, 2018.
  42. Gaitgraph: graph convolutional network for skeleton-based gait recognition. In 2021 IEEE International Conference on Image Processing (ICIP), pages 2314–2318. IEEE, 2021.
  43. Towards a deeper understanding of skeleton-based gait recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1569–1577, 2022.
  44. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10):3349–3364, 2020.
  45. Pyramid spatial-temporal aggregation for video-based person re-identification. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12026–12035, 2021.
  46. A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. In 18th International Conference on Pattern Recognition (ICPR’06), pages 441–444. IEEE, 2006.
  47. Multidirection and multiscale pyramid in transformer for video-based pedestrian retrieval. IEEE Transactions on Industrial Informatics, 18(12):8776–8785, 2022.
  48. Spatial transformer network on skeleton-based gait recognition. Expert Systems, page e13244, 2023.
  49. Gait recognition via disentangled representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4710–4719, 2019.
  50. Gait recognition in the wild with dense 3d representations and a benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  51. Parsing is all you need for accurate gait recognition in the wild. arXiv preprint arXiv:2308.16739, 2023.
Citations (7)

Summary

  • The paper introduces BigGait, a framework that utilizes task-agnostic large vision models to extract gait features without relying on manual annotations.
  • The method employs a Gait Representation Extractor with mask, appearance, and denoising branches to filter noise and emphasize gait-relevant information.
  • Experimental results on CCPG, CAISA-B*, and SUSTech1K datasets demonstrate superior performance, although challenges in feature interpretability remain.

BigGait: Leveraging Large Vision Models for Gait Recognition

Gait recognition has emerged as a key area of focus in biometric identification due to its non-invasive nature and the ability to recognize individuals from a distance. Traditional gait recognition techniques rely heavily on task-specific upstream processes, such as pedestrian segmentation or pose estimation, which necessitate supervised learning and painstaking manual annotation. The paper "BigGait: Learning Gait Representation You Want by Large Vision Models" diverges from this conventional approach by leveraging task-agnostic Large Vision Models (LVMs) to derive gait representations, eliminating the requirement for specific upstream tasks. This paper introduces a novel framework, termed BigGait, that effectively translates all-purpose knowledge from LVMs into usable gait features in an unsupervised manner.

The BigGait framework consists of three principal components: the upstream model, the central Gait Representation Extractor (GRE), and the downstream gait metric learning model. The paper employs DINOv2 as the upstream model due to its robust performance in generating all-purpose features, and GaitBase serves as the downstream model. The core innovation, the GRE module, bridges the upstream and downstream models by transforming the extracted features to focus on gait-relevant information through its three branches: the mask, appearance, and denoising branches. These branches work synergistically to remove background noise, enhance feature transformation, and refine the noise characteristics, respectively.

Experimental results on datasets such as CCPG, CAISA-B*, and SUSTech1K show that BigGait generally surpasses previous methods in self-domain and cross-domain tasks. Such results illustrate the effectiveness of using LVMs for gait representation, suggesting a new paradigm in biometric recognition that minimizes reliance on task-specific inputs and manual annotations.

However, the paper also identifies some challenges associated with employing LVMs for gait recognition. One major concern is the interpretability of the feature representations, which, unlike traditional gait representations, lack explicit physical meanings. Another challenge is maintaining purity, where the representation needs to prioritize gait features devoid of unrelated noise. These challenges point towards critical areas for future research, particularly in developing interpretative techniques and refining feature extraction processes to improve the fidelity of gait representations.

The theoretical implications of this work highlight a potential shift towards more generalized feature extraction techniques in computer vision, decreasing dependency on domain-specific knowledge and manual annotation. Practically, this approach could reduce costs associated with data labeling across biometric applications, promoting ease in developing recognition systems in domains where annotated data is scarce or cumbersome to obtain.

As the exploration of LVMs in gait recognition progresses, future research could focus on addressing challenges identified in this paper and exploring diverse LVM architectures. There is also scope for applying the insights from BigGait to broader areas in image-based recognition tasks, advocating for the use of task-agnostic features in enhancing model generalization and performance. The paper provides both a promising direction for future exploration and a practical contribution in reducing the resource constraints typically associated with traditional gait and biometric recognition methods.