Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers (2306.09331v1)
Abstract: Human perception of surroundings is often guided by the various poses present within the environment. Many computer vision tasks, such as human action recognition and robot imitation learning, rely on pose-based entities like human skeletons or robotic arms. However, conventional Vision Transformer (ViT) models uniformly process all patches, neglecting valuable pose priors in input videos. We argue that incorporating poses into RGB data is advantageous for learning fine-grained and viewpoint-agnostic representations. Consequently, we introduce two strategies for learning pose-aware representations in ViTs. The first method, called Pose-aware Attention Block (PAAB), is a plug-and-play ViT block that performs localized attention on pose regions within videos. The second method, dubbed Pose-Aware Auxiliary Task (PAAT), presents an auxiliary pose prediction task optimized jointly with the primary ViT task. Although their functionalities differ, both methods succeed in learning pose-aware representations, enhancing performance in multiple diverse downstream tasks. Our experiments, conducted across seven datasets, reveal the efficacy of both pose-aware methods on three video analysis tasks, with PAAT holding a slight edge over PAAB. Both PAAT and PAAB surpass their respective backbone Transformers by up to 9.8% in real-world action recognition and 21.8% in multi-view robotic video alignment. Code is available at https://github.com/dominickrei/PoseAwareVT.
- Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.385. URL https://aclanthology.org/2020.acl-main.385.
- Non-intrusive human activity monitoring in a smart home environment. In 2013 IEEE 15th International Conference on e-Health Networking, Applications and Services (Healthcom 2013), pages 606–610, 2013. doi: 10.1109/HealthCom.2013.6720748.
- Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, October 2021.
- Human action recognition: Pose-based attention draws focus to hands. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pages 604–613, Oct 2017. doi: 10.1109/ICCVW.2017.77.
- Glimpse clouds: Human activity recognition from unstructured feature points. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning (ICML), July 2021.
- Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015.
- Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
- End-to-end object detection with transformers, 2020.
- Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4733. IEEE, 2017.
- Return of the devil in the details: Delving deep into convolutional nets. In The British Machine Vision Conference (BMVC), 2014.
- Chun-Fu Richard Chen et al. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
- Infogcn: Representation learning for human skeleton-based action recognition. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20154–20164, 2022. doi: 10.1109/CVPR52688.2022.01955.
- Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2019.
- Toyota smarthome: Real-world activities of daily living. In Proceedings of the International Conference on Computer Vision (ICCV), 2019.
- Vpn: Learning video-pose embedding for activities of daily living. In European Conference on Computer Vision, pages 72–90. Springer, 2020.
- Vpn++: Rethinking video-pose embeddings for understanding activities of daily living. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2021. doi: 10.1109/TPAMI.2021.3127885.
- An image is worth 16x16 words: Transformers for image recognition at scale. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Skeleton based action recognition with convolutional neural network. In 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pages 579–583, 2015. doi: 10.1109/ACPR.2015.7486569.
- Multiscale vision transformers. In ICCV, 2021.
- Rmpe: Regional multi-person pose estimation. 2017 IEEE International Conference on Computer Vision (ICCV), pages 2353–2362, 2016.
- Christoph Feichtenhofer. X3D: expanding architectures for efficient video recognition. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 200–210. Computer Vision Foundation / IEEE, 2020. doi: 10.1109/CVPR42600.2020.00028. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Feichtenhofer_X3D_Expanding_Architectures_for_Efficient_Video_Recognition_CVPR_2020_paper.html.
- Convolutional two-stream network fusion for video action recognition. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 1933–1941. IEEE, 2016.
- Slowfast networks for video recognition. CoRR, abs/1812.03982, 2018. URL http://arxiv.org/abs/1812.03982.
- Stacked spatio-temporal graph convolutional networks for action segmentation. arXiv preprint arXiv:1811.10575, 2018.
- The "something something" video database for learning and evaluating visual common sense. CoRR, abs/1706.04261, 2017. URL http://arxiv.org/abs/1706.04261.
- AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions. Conference on Computer Vision and Pattern Recognition(CVPR), 2018.
- Unified keypoint-based action recognition framework via structured keypoint pooling. arXiv preprint arXiv:2303.15270, 2023.
- Transformer in transformer. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=iFODavhthGZ.
- Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- A new representation of skeleton sequences for 3d action recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 4570–4579. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.486. URL https://doi.org/10.1109/CVPR.2017.486.
- Learning human activities and object affordances from rgb-d videos. In IJRR, 2013.
- HMDB: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563. IEEE, 2011.
- Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, 2022.
- Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, 2019.
- Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. doi: 10.1109/TPAMI.2019.2916873.
- Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 68:346 – 362, 2017. ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2017.02.030. URL http://www.sciencedirect.com/science/article/pii/S0031320317300936.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the International Conference on Computer Vision (ICCV), 2021a.
- Video swin transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3192–3201, 2021b.
- Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 143–152, 2020.
- 2d/3d pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5137–5146, 2018.
- What matters in learning from offline human demonstrations for robot manipulation. In arXiv preprint arXiv:2108.03298, 2021.
- Keeping your eye on the ball: Trajectory attention in video transformers. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 12493–12506, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/67f7fb873eaf29526a11a9b7ac33bfac-Abstract.html.
- 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Faster r-cnn: Towards real-time object detection with region proposal networks, 2016.
- LCR-Net++: Multi-person 2D and 3D Pose Detection in Natural Images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
- Assemblenet++: Assembling modality representations via attention connections. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
- Video transformers: A survey. IEEE transactions on pattern analysis and machine intelligence, PP, 2022.
- Time-contrastive networks: Self-supervised learning from multi-view observation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 486–487, 2017. doi: 10.1109/CVPRW.2017.69.
- Time-contrastive networks: Self-supervised learning from video. In IEEE International Conference on Robotics and Automation (ICRA), pages 1134–1141. IEEE, 2018.
- Ntu rgb+d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- Self-supervised disentangled representation learning for third-person imitation learning. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 214–221, 2021. doi: 10.1109/IROS51168.2021.9636363.
- Learning viewpoint-agnostic visual representations by recovering tokens in 3d space. In Advances in Neural Information Processing Systems, 2022.
- Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Transactions on Image Processing, 29:9532–9545, 2020.
- Real-time human pose recognition in parts from single depth images. In CVPR 2011, pages 1297–1304, 2011. doi: 10.1109/CVPR.2011.5995316.
- Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. In European Conference on Computer Vision(ECCV), 2016.
- Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
- UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012. URL http://arxiv.org/abs/1212.0402.
- Segmenter: Transformer for semantic segmentation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7242–7252, 2021.
- Human activity detection from rgbd images. In AAAI workshop, 2011.
- Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016. doi: 10.1109/CVPR.2016.308.
- Training data-efficient image transformers: Distillation through attention. In Proceedings of the International Conference on Machine Learning (ICML), volume 139, pages 10347–10357, July 2021.
- Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pages 4489–4497, Washington, DC, USA, 2015. IEEE Computer Society. ISBN 978-1-4673-8391-2. doi: 10.1109/ICCV.2015.510. URL http://dx.doi.org/10.1109/ICCV.2015.510.
- Synthetic humans for action recognition from unseen viewpoints. International Journal of Computer Vision, 129:2264 – 2287, 2019.
- Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Mining Actionlet Ensemble for Action Recognition with Depth Cameras. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
- Cross-view action modeling, learning, and recognition. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 2649–2656, June 2014. doi: 10.1109/CVPR.2014.339.
- Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In European Conference on Computer Vision, 2017.
- Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI conference on artificial intelligence, 2018.
- Mmnet: A model-based multimodal network for human action recognition in rgb-d videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP:1–1, 05 2022. doi: 10.1109/TPAMI.2022.3177813.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pages 558–567, 2021.
- Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2923–2932. IEEE, 2017.