Transformer-based Fusion of 2D-pose and Spatio-temporal Embeddings for Distracted Driver Action Recognition (2403.06577v1)
Abstract: Classification and localization of driving actions over time is important for advanced driver-assistance systems and naturalistic driving studies. Temporal localization is challenging because it requires robustness, reliability, and accuracy. In this study, we aim to improve the temporal localization and classification accuracy performance by adapting video action recognition and 2D human-pose estimation networks to one model. Therefore, we design a transformer-based fusion architecture to effectively combine 2D-pose features and spatio-temporal features. The model uses 2D-pose features as the positional embedding of the transformer architecture and spatio-temporal features as the main input to the encoder of the transformer. The proposed solution is generic and independent of the camera numbers and positions, giving frame-based class probabilities as output. Finally, the post-processing step combines information from different camera views to obtain final predictions and eliminate false positives. The model performs well on the A2 test set of the 2023 NVIDIA AI City Challenge for naturalistic driving action recognition, achieving the overlap score of the organizer-defined distracted driver behaviour metric of 0.5079.
- National Highway Traffic Safety Administration. Traffic safety facts research note: Distracted driving 2019 (dot hs 813 111). National Highway Traffic Safety Administration., 2021.
- Density-guided label smoothing for temporal localization of driving actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3174–3182, 2022.
- Skelemotion: A new representation of skeleton joint sequences based on motion information for 3d action recognition. CoRR, abs/1907.13025, 2019.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- Potion: Pose motion representation for action recognition. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7024–7033, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Revisiting skeleton-based action recognition. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2959–2968, 2022.
- Driver distraction identification with an ensemble of convolutional neural networks. Journal of Advanced Transportation, 2019, 2019.
- Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6824–6835, 2021.
- Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020.
- Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6202–6211, 2019.
- Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
- Long short-term memory. Neural computation, 9:1735–80, 12 1997.
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
- Mmtm: Multimodal transfer module for cnn fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13289–13299, 2020.
- Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Driver anomaly detection: A dataset and contrastive learning approach. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 91–100, 2021.
- Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055, 2018.
- Semantic–structural graph convolutional networks for whole-body human pose estimation. Information, 13:109, 02 2022.
- Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019.
- Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014.
- Two-stream 3d convolutional neural network for skeleton-based action recognition. arXiv preprint arXiv:1705.08106, 2017.
- Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
- 2d/3d pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5137–5146, 2018.
- Pose estimation for augmented reality: A hands-on survey. IEEE Transactions on Visualization and Computer Graphics, 22(12):2633–2651, 2016.
- Drive&act: A multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2801–2810, 2019.
- The 7th AI City Challenge. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2023.
- Synthetic Distracted Driving (SynDD2) dataset for analyzing distracted behaviors and various gaze zones of a driver, 2022. arXiv:2204.08096.
- Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015.
- Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems, 1, 06 2014.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Deep high-resolution representation learning for human pose estimation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5686–5696, 2019.
- Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
- Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
- A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
- Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018.
- Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 591–600, 2020.
- Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
- Trajectory convolution for action recognition. Advances in neural information processing systems, 31, 2018.
- Robust graph convolutional networks against adversarial attacks. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1399–1407, 2019.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.