Multi-scale Context-aware Network with Transformer for Gait Recognition (2204.03270v3)
Abstract: Although gait recognition has drawn increasing research attention recently, since the silhouette differences are quite subtle in spatial domain, temporal feature representation is crucial for gait recognition. Inspired by the observation that humans can distinguish gaits of different subjects by adaptively focusing on clips of varying time scales, we propose a multi-scale context-aware network with transformer (MCAT) for gait recognition. MCAT generates temporal features across three scales, and adaptively aggregates them using contextual information from both local and global perspectives. Specifically, MCAT contains an adaptive temporal aggregation (ATA) module that performs local relation modeling followed by global relation modeling to fuse the multi-scale features. Besides, in order to remedy the spatial feature corruption resulting from temporal operations, MCAT incorporates a salient spatial feature learning (SSFL) module to select groups of discriminative spatial features. Extensive experiments conducted on three datasets demonstrate the state-of-the-art performance. Concretely, we achieve rank-1 accuracies of 98.7%, 96.2% and 88.7% under normal walking, bag-carrying and coat-wearing conditions on CASIA-B, 97.5% on OU-MVLP and 50.6% on GREW. The source code will be available at https://github.com/zhuduowang/MCAT.git.
- P. K. Larsen, E. B. Simonsen, and N. Lynnerup, “Gait analysis in forensic medicine,” Journal of forensic sciences, vol. 53, no. 5, pp. 1149–1153, 2008.
- I. Bouchrika, M. Goffredo, J. Carter, and M. Nixon, “On using gait in forensic biometrics,” Journal of forensic sciences, vol. 56, no. 4, pp. 882–889, 2011.
- S. X. Yang, P. K. Larsen, T. Alkjær, N. Lynnerup, and E. B. Simonsen, “Influence of velocity on variability in gait kinematics: implications for recognition in forensic science,” Journal of forensic sciences, vol. 59, no. 5, pp. 1242–1247, 2014.
- S. X. Yang, P. K. Larsen, T. Alkjær, E. B. Simonsen, and N. Lynnerup, “Variability and similarity of gait as evaluated by joint angles: implications for forensic gait analysis,” Journal of forensic sciences, vol. 59, no. 2, pp. 494–504, 2014.
- I. Macoveciuc, C. J. Rando, and H. Borrion, “Forensic gait analysis and recognition: standards of evidence admissibility,” Journal of forensic sciences, vol. 64, no. 5, pp. 1294–1303, 2019.
- M. Balazia and K. N. Plataniotis, “Human gait recognition from motion capture data in signature poses,” IET Biometrics, vol. 6, no. 2, pp. 129–137, 2017.
- G. Premalatha and P. V Chandramani, “Improved gait recognition through gait energy image partitioning,” Computational Intelligence, vol. 36, no. 3, pp. 1261–1274, 2020.
- C. Fan, Y. Peng, C. Cao, X. Liu, S. Hou, J. Chi, Y. Huang, Q. Li, and Z. He, “Gaitpart: Temporal part-based model for gait recognition,” CVPR, pp. 14 225–14 233, 2020.
- H. Wu, J. Tian, Y. Fu, B. Li, and X. Li, “Condition-aware comparison scheme for gait recognition,” IEEE Transactions on Image Processing, 2020.
- Y. Zhang, Y. Huang, S. Yu, and L. Wang, “Cross-view gait recognition by discriminative feature learning,” TIP, vol. 29, pp. 1001–1015, 2019.
- Z. Zhang, L. Tran, F. Liu, and X. Liu, “On learning disentangled representations for gait recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- B. Lin, S. Zhang, and F. Bao, “Gait recognition with multiple-temporal-scale 3d convolutional neural network,” ACMMM, pp. 3054–3062, 2020.
- B. Lin, S. Zhang, and X. Yu, “Gait recognition via effective global-local feature representation and local temporal aggregation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 648–14 656.
- P. Zhang, X. Dai, J. Yang, B. Xiao, L. Yuan, L. Zhang, and J. Gao, “Multi-scale vision longformer: A new vision transformer for high-resolution image encoding,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 2998–3008.
- C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale vision transformer for image classification,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 357–366.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- H. Liu, Y. Liu, Y. Chen, C. Yuan, B. Li, and W. Hu, “Transkeleton: Hierarchical spatial-temporal transformer for skeleton-based action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- X. Zhu, Y. Zhou, D. Wang, W. Ouyang, and R. Su, “Mlst-former: Multi-level spatial-temporal transformer for group activity recognition,” IEEE Transactions on Circuits and Systems for Video Technology, 2022.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- S. Yu, D. Tan, and T. Tan, “A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition,” ICPR, vol. 4, pp. 441–444, 2006.
- N. Takemura, Y. Makihara, D. Muramatsu, T. Echigo, and Y. Yagi, “Multi-view large population gait dataset and its performance evaluation for cross-view gait recognition,” IPSJ Transactions on Computer Vision and Applications, vol. 10, no. 1, p. 4, 2018.
- Z. Zhu, X. Guo, T. Yang, J. Huang, J. Deng, G. Huang, D. Du, J. Lu, and J. Zhou, “Gait recognition in the wild: A benchmark,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 789–14 799.
- X. Huang, D. Zhu, H. Wang, X. Wang, B. Yang, B. He, W. Liu, and B. Feng, “Context-sensitive temporal feature learning for gait recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 909–12 918.
- R. Liao, C. Cao, E. B. Garcia, S. Yu, and Y. Huang, “Pose-based temporal-spatial network (ptsn) for gait recognition with carrying and clothing variations,” Chinese conference on biometric recognition, pp. 474–483, 2017.
- T. Teepe, A. Khan, J. Gilg, F. Herzog, S. Hörmann, and G. Rigoll, “Gaitgraph: Graph convolutional network for skeleton-based gait recognition,” arXiv preprint arXiv:2101.11228, 2021.
- X. Huang, X. Wang, Z. Jin, B. Yang, B. He, B. Feng, and W. Liu, “Condition-adaptive graph convolution learning for skeleton-based gait recognition,” IEEE Transactions on Image Processing, vol. 32, pp. 4773–4784, 2023.
- Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Openpose: realtime multi-person 2d pose estimation using part affinity fields,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 1, pp. 172–186, 2019.
- K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703, 2019.
- Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7291–7299, 2017.
- H. Chao, K. Wang, Y. He, J. Zhang, and J. Feng, “Gaitset: Cross-view gait recognition through utilizing gait as a deep set,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- S. Hou, C. Cao, X. Liu, and Y. Huang, “Gait lateral network: Learning discriminative and compact representations for gait recognition,” European Conference on Computer Vision, pp. 382–398, 2020.
- T. Wolf, M. Babaee, and G. Rigoll, “Multi-view gait recognition using 3d convolutional neural networks,” ICIP, pp. 4165–4169, 2016.
- J. Han and B. Bhanu, “Individual recognition using gait energy image,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 2, pp. 316–322, 2005.
- Y. He, J. Zhang, H. Shan, and L. Wang, “Multi-task gans for view-specific feature learning in gait recognition,” IEEE Transactions on Information Forensics and Security, vol. 14, no. 1, pp. 102–113, 2018.
- Z. Wu, Y. Huang, L. Wang, X. Wang, and T. Tan, “A comprehensive study on cross-view gait based human identification with deep cnns,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 2, pp. 209–226, 2016.
- M. Hu, Y. Wang, Z. Zhang, J. J. Little, and D. Huang, “View-invariant discriminative projection for multi-view gait-based human identification,” IEEE Transactions on Information Forensics and Security, vol. 8, no. 12, pp. 2034–2045, 2013.
- X. Li, Y. Makihara, C. Xu, Y. Yagi, and M. Ren, “Gait recognition via semi-supervised disentangled representation learning to identity and covariate features,” CVPR, pp. 13 309–13 319, 2020.
- C. Xu, Y. Makihara, X. Li, Y. Yagi, and J. Lu, “Cross-view gait recognition using pairwise spatial transformer networks,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 1, pp. 260–274, 2020.
- Z. Huang, D. Xue, X. Shen, X. Tian, H. Li, J. Huang, and X.-S. Hua, “3d local convolutional neural networks for gait recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 920–14 929.
- Y. Cui and Y. Kang, “Gaittransformer: Multiple-temporal-scale transformer for cross-view gait recognition,” in 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2022, pp. 1–6.
- X. Huang, X. Wang, B. He, S. He, W. Liu, and B. Feng, “Star: Spatio-temporal augmented relation network for gait recognition,” IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 5, no. 1, pp. 115–125, 2023.
- C. Fan, J. Liang, C. Shen, S. Hou, Y. Huang, and S. Yu, “Opengait: Revisiting gait recognition towards better practicality,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9707–9716.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
- J. Zhao, H. Wang, Y. Zhou, R. Yao, S. Chen, and A. El Saddik, “Spatial-channel enhanced transformer for visible-infrared person re-identification,” IEEE Transactions on Multimedia, 2022.
- Z. Tang, R. Zhang, Z. Peng, J. Chen, and L. Lin, “Multi-stage spatio-temporal aggregation transformer for video person re-identification,” IEEE Transactions on Multimedia, 2022.
- Y. Zhang, Y. Pan, T. Yao, R. Huang, T. Mei, and C.-W. Chen, “End-to-end video scene graph generation with temporal propagation transformer,” IEEE Transactions on Multimedia, 2023.
- A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6836–6846.
- Y. Zhang, X. Li, C. Liu, B. Shuai, Y. Zhu, B. Brattoli, H. Chen, I. Marsic, and J. Tighe, “Vidtr: Video transformer without convolutions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 577–13 587.
- X. Gu, H. Chang, B. Ma, H. Zhang, and X. Chen, “Appearance-preserving 3d convolution for video-based person re-identification,” European Conference on Computer Vision, pp. 228–243, 2020.
- G. Chen, Y. Rao, J. Lu, and J. Zhou, “Temporal coherence or temporal motion: Which is more critical for video-based person re-identification,” European Conference on Computer Vision, pp. 660–676, 2020.
- X. Chu, Z. Tian, B. Zhang, X. Wang, X. Wei, H. Xia, and C. Shen, “Conditional positional encodings for vision transformers,” arXiv preprint arXiv:2102.10882, 2021.
- S. N. Gowda, M. Rohrbach, and L. Sevilla-Lara, “Smart frame selection for action recognition,” arXiv preprint arXiv:2012.10671, 2020.
- O. Köpüklü, X. Wei, and G. Rigoll, “You only watch once: A unified cnn architecture for real-time spatiotemporal action localization,” arXiv preprint arXiv:1911.06644, 2019.
- A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint arXiv:1703.07737, 2017.
- A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” NIPS Workshops, 2017.
- T. Chai, A. Li, S. Zhang, Z. Li, and Y. Wang, “Lagrange motion analysis and view embeddings for improved gait recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 249–20 258.
- K. Shiraga, Y. Makihara, D. Muramatsu, T. Echigo, and Y. Yagi, “Geinet: View-invariant gait recognition using a convolutional neural network,” in 2016 international conference on biometrics (ICB). IEEE, 2016, pp. 1–8.
- L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.