Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Just Add $π$! Pose Induced Video Transformers for Understanding Activities of Daily Living (2311.18840v1)

Published 30 Nov 2023 in cs.CV

Abstract: Video transformers have become the de facto standard for human action recognition, yet their exclusive reliance on the RGB modality still limits their adoption in certain domains. One such domain is Activities of Daily Living (ADL), where RGB alone is not sufficient to distinguish between visually similar actions, or actions observed from multiple viewpoints. To facilitate the adoption of video transformers for ADL, we hypothesize that the augmentation of RGB with human pose information, known for its sensitivity to fine-grained motion and multiple viewpoints, is essential. Consequently, we introduce the first Pose Induced Video Transformer: PI-ViT (or $\pi$-ViT), a novel approach that augments the RGB representations learned by video transformers with 2D and 3D pose information. The key elements of $\pi$-ViT are two plug-in modules, 2D Skeleton Induction Module and 3D Skeleton Induction Module, that are responsible for inducing 2D and 3D pose information into the RGB representations. These modules operate by performing pose-aware auxiliary tasks, a design choice that allows $\pi$-ViT to discard the modules during inference. Notably, $\pi$-ViT achieves the state-of-the-art performance on three prominent ADL datasets, encompassing both real-world and large-scale RGB-D datasets, without requiring poses or additional computational overhead at inference.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, Online, 2020. Association for Computational Linguistics.
  2. Star-transformer: A spatio-temporal cross attention transformer for human action recognition. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3319–3328, 2023.
  3. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, 2021.
  4. Layer normalization, 2016.
  5. Glimpse clouds: Human activity recognition from unstructured feature points. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  6. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning (ICML), 2021.
  7. Quo vadis, action recognition? a new model and the kinetics dataset. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4724–4733. IEEE, 2017.
  8. A short note about kinetics-600. CoRR, abs/1808.01340, 2018.
  9. Return of the devil in the details: Delving deep into convolutional nets. In Brit. Mach. Vis. Conf., 2014.
  10. Infogcn: Representation learning for human skeleton-based action recognition. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20154–20164, 2022.
  11. Viewclr: Learning self-supervised video representation for unseen viewpoints. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2-7, 2023, pages 5562–5572. IEEE, 2023.
  12. Where to focus on for human action recognition? In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 71–80, 2019a.
  13. Toyota smarthome: Real-world activities of daily living. In Int. Conf. Comput. Vis., 2019b.
  14. Vpn: Learning video-pose embedding for activities of daily living. In European Conference on Computer Vision, pages 72–90. Springer, 2020.
  15. Vpn++: Rethinking video-pose embeddings for understanding activities of daily living. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2021.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. Int. Conf. Learn. Represent., 2021.
  17. Revisiting skeleton-based action recognition. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2959–2968, 2022.
  18. Multiscale vision transformers. In ICCV, 2021.
  19. Christoph Feichtenhofer. X3D: expanding architectures for efficient video recognition. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 200–210. Computer Vision Foundation / IEEE, 2020.
  20. Convolutional two-stream network fusion for video action recognition. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 1933–1941. IEEE, 2016.
  21. Slowfast networks for video recognition. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 6201–6210. IEEE, 2019.
  22. The ”something something” video database for learning and evaluating visual common sense. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 5843–5851. IEEE Computer Society, 2017.
  23. Unified keypoint-based action recognition framework via structured keypoint pooling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22962–22971, 2023.
  24. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016a.
  25. Human action recognition without human. In ECCV Workshops, 2016b.
  26. Distilling the knowledge in a neural network, 2015.
  27. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  28. Cross-modal learning with 3d deformable attention for action recognition. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  29. HMDB: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563. IEEE, 2011.
  30. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In CVPR, 2019.
  31. Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, 2022.
  32. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, 2019.
  33. Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  34. Swin transformer: Hierarchical vision transformer using shifted windows. In Int. Conf. Comput. Vis., 2021a.
  35. Video swin transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3192–3201, 2021b.
  36. Keeping your eye on the ball: Trajectory attention in video transformers. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 12493–12506, 2021.
  37. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  38. Recognizing actions in videos from unseen viewpoints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4124–4132, 2021.
  39. LCR-Net++: Multi-person 2D and 3D Pose Detection in Natural Images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  40. Assemblenet++: Assembling modality representations via attention connections. In Eur. Conf. Comput. Vis., 2020.
  41. Grad-cam: Visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 618–626, 2017.
  42. Multi-view action recognition using contrastive learning. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3370–3380, 2023.
  43. Ntu rgb+d: A large scale dataset for 3d human activity analysis. In IEEE Conf. Comput. Vis. Pattern Recog., 2016.
  44. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Transactions on Image Processing, 29:9532–9545, 2020.
  45. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. In European Conference on Computer Vision(ECCV), 2016.
  46. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
  47. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
  48. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016.
  49. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, pages 10347–10357. PMLR, 2021.
  50. Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
  51. Synthetic humans for action recognition from unseen viewpoints. International Journal of Computer Vision, 129:2264 – 2287, 2019.
  52. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  53. Multi-view action recognition using cross-view video prediction. In European Conference on Computer Vision, 2020.
  54. Mining Actionlet Ensemble for Action Recognition with Depth Cameras. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  55. Cross-view action modeling, learning, and recognition. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 2649–2656, 2014.
  56. 3mformer: Multi-order multi-mode transformer for skeletal action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5620–5631, 2023.
  57. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI conference on artificial intelligence, 2018.
  58. Self-supervised video representation learning via latent time navigation. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence. AAAI Press, 2023.
  59. Hypergraph transformer for skeleton-based action recognition. arXiv preprint arXiv:2211.09590, 2022.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com