Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 32 tok/s
GPT-5 High 40 tok/s Pro
GPT-4o 83 tok/s
GPT OSS 120B 467 tok/s Pro
Kimi K2 197 tok/s Pro
2000 character limit reached

GestFormer: Multiscale Wavelet Pooling Transformer Network for Dynamic Hand Gesture Recognition (2405.11180v1)

Published 18 May 2024 in cs.CV and cs.HC

Abstract: Transformer model have achieved state-of-the-art results in many applications like NLP, classification, etc. But their exploration in gesture recognition task is still limited. So, we propose a novel GestFormer architecture for dynamic hand gesture recognition. The motivation behind this design is to propose a resource efficient transformer model, since transformers are computationally expensive and very complex. So, we propose to use a pooling based token mixer named PoolFormer, since it uses only pooling layer which is a non-parametric layer instead of quadratic attention. The proposed model also leverages the space-invariant features of the wavelet transform and also the multiscale features are selected using multi-scale pooling. Further, a gated mechanism helps to focus on fine details of the gesture with the contextual information. This enhances the performance of the proposed model compared to the traditional transformer with fewer parameters, when evaluated on dynamic hand gesture datasets, NVidia Dynamic Hand Gesture and Briareo datasets. To prove the efficacy of the proposed model, we have experimented on single as well multimodal inputs such as infrared, normals, depth, optical flow and color images. We have also compared the proposed GestFormer in terms of resource efficiency and number of operations. The source code is available at https://github.com/mallikagarg/GestFormer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1165–1174, 2019.
  2. Hand gesture recognition using discrete wavelet transform and support vector machine. In 2015 2nd International Conference on Signal Processing and Integrated Networks (SPIN), pages 489–493. IEEE, 2015.
  3. Incorporating relative position information in transformer-based sign language recognition and translation. IEEE Access, 9:145929–145942, 2021.
  4. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
  5. Rowel Atienza. Vision transformer for fast and efficient scene text recognition. In International Conference on Document Analysis and Recognition, pages 319–334. Springer, 2021.
  6. Exploiting recurrent neural networks and leap motion controller for the recognition of sign language and semaphoric hand gestures. IEEE Transactions on Multimedia, 21(1):234–245, 2018.
  7. Egocentric gesture recognition using recurrent 3d convolutional neural networks with spatiotemporal transformer modules. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  8. Location-aware self-supervised transformers for semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 117–127, 2024.
  9. Psvit: Better vision transformer via token pooling and attention sharing. arXiv preprint arXiv:2108.03428, 2021.
  10. Multi-scale attention 3d convolutional network for multimodal gesture recognition. Sensors, 22, 2022a.
  11. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35:16664–16678, 2022b.
  12. Time-of-flight cameras and Microsoft KinectTM. Springer Science & Business Media, 2012.
  13. Sign language recognition with transformer networks. In Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020.
  14. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  15. Textual alchemy: Coformer for scene text understanding. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2931–2941, 2024.
  16. Res3atn-deep 3d residual attention network for hand gesture recognition in videos. In 2019 international conference on 3D vision (3DV), pages 491–501. IEEE, 2019.
  17. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12124–12134, 2022.
  18. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  19. Multimodal hand gesture classification for the human–car interaction. In Informatics, page 31, 2020a.
  20. A transformer-based network for dynamic hand gesture recognition. In International Conference on 3D Vision (3DV), pages 623–632. IEEE, 2020b.
  21. Multi-task and multi-modal learning for rgb dynamic gesture recognition. IEEE Sensors Journal, 21(23):27026–27036, 2021a.
  22. Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021b.
  23. Multiview hand gesture recognition using deep learning. In 2021 IEEE 18th India Council International Conference (INDICON), 2021.
  24. Multiscaled multi-head attention-based video transformer network for hand gesture recognition. IEEE Signal Processing Letters, 30:80–84, 2023.
  25. Convolutional transformer fusion blocks for multi-modal gesture recognition. IEEE Access, 11:34094–34103, 2023.
  26. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  27. Spatial transformer networks. Advances in neural information processing systems, 28, 2015.
  28. Wavemixsr: Resource-efficient neural network for image super-resolution. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5884–5892, 2024.
  29. Attention-based gated recurrent unit for gesture recognition. IEEE Transactions on Automation Science and Engineering, 18(2):495–507, 2020.
  30. Motion fused frames: Data level fusion strategy for hand gesture recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 2103–2111, 2018.
  31. Real-time hand gesture detection and classification using convolutional neural networks. In 2019 14th IEEE international conference on automatic face & gesture recognition (FG 2019), pages 1–8. IEEE, 2019.
  32. Coupled hmm-based multi-sensor data fusion for sign language recognition. Pattern Recognition Letters, 86:1–8, 2017.
  33. Independent bayesian classifier combination based sign language recognition using facial expression. Information Sciences, 428:30–48, 2018.
  34. Fnet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824, 2021.
  35. Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  36. Improved multiscale vision transformers for classification and detection. arXiv preprint arXiv:2112.01526, 2021.
  37. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  38. A two-stage convolutional neural network for hand gesture recognition. In Proceedings of the 6th International Conference on Advance Computing and Intelligent Engineering: ICACIE 2021, 2022.
  39. Hand gestures for the human-car interaction: The briareo dataset. In Image Analysis and Processing–ICIAP 2019: 20th International Conference, Trento, Italy, September 9–13, 2019, Proceedings, Part II 20, pages 560–571. Springer, 2019.
  40. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4207–4215, 2016.
  41. Video transformer network. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3163–3172, 2021.
  42. Handoccnet: Occlusion-robust 3d hand mesh estimation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1496–1505, 2022.
  43. Blind image inpainting via omni-dimensional gated attention and wavelet queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1251–1260, 2023.
  44. The leap motion controller: a view on sign language. In Proceedings of the 25th Australian computer-human interaction conference: augmentation, application, innovation, collaboration, pages 175–178, 2013.
  45. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.
  46. Chatgpt for conversational recommendation: Refining recommendations by reprompting with feedback. arXiv preprint arXiv:2401.03605, 2024.
  47. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
  48. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
  49. Patches are all you need? arXiv preprint arXiv:2201.09792, 2022.
  50. Action recognition? a new model and the kinetics dataset. Joao Carreira, Andrew Zisserman.
  51. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  52. A robust and efficient video representation for action recognition. International journal of computer vision, 119:219–238, 2016.
  53. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  54. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021.
  55. Task-oriented dialogue system as natural language generation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2698–2703, 2022.
  56. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 22–31, 2021.
  57. P2t: Pyramid pooling transformer for scene understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  58. Super normal vector for activity recognition using depth sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 804–811, 2014.
  59. Making convolutional networks recurrent for visual sequence learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6469–6478, 2018.
  60. Ubar: Towards fully end-to-end task-oriented dialog system with gpt-2. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 14230–14238, 2021.
  61. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10819–10829, 2022.
  62. Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition. IEEE Transactions on Image Processing, 30:5626–5640, 2021.
  63. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 579–588, 2021.
  64. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739, 2022.
  65. Dynamic hand gesture recognition based on short-term sampling neural networks. IEEE/CAA Journal of Automatica Sinica, 8(1):110–120, 2020.
  66. Potter: Pooling attention transformer for efficient human mesh recovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1611–1620, 2023a.
  67. Poster: A pyramid cross-fusion transformer network for facial expression recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3146–3155, 2023b.
  68. Decoupling and recoupling spatiotemporal representation for rgb-d-based motion recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20154–20163, 2022a.
  69. Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021.
  70. Centerformer: Center-based transformer for 3d object detection. In European Conference on Computer Vision, pages 496–513. Springer, 2022b.
  71. Waveformer: Linear-time attention with forward and backward wavelet transform. 2022.
Citations (2)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets