Language Model Guided Interpretable Video Action Reasoning (2404.01591v1)
Abstract: While neural networks have excelled in video action recognition tasks, their black-box nature often obscures the understanding of their decision-making processes. Recent approaches used inherently interpretable models to analyze video actions in a manner akin to human reasoning. These models, however, usually fall short in performance compared to their black-box counterparts. In this work, we present a new framework named Language-guided Interpretable Action Recognition framework (LaIAR). LaIAR leverages knowledge from LLMs to enhance both the recognition capabilities and the interpretability of video models. In essence, we redefine the problem of understanding video model decisions as a task of aligning video and LLMs. Using the logical reasoning captured by the LLM, we steer the training of the video model. This integrated approach not only improves the video model's adaptability to different domains but also boosts its overall performance. Extensive experiments on two complex video action datasets, Charades & CAD-120, validates the improved performance and interpretability of our LaIAR framework. The code of LaIAR is available at https://github.com/NingWang2049/LaIAR.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Is space-time attention all you need for video understanding? In ICML, page 4, 2021.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- Learning the best pooling strategy for visual semantic embedding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Spatial-temporal transformer for dynamic scene graph generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16372–16382, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arxiv 2020. arXiv preprint arXiv:2010.11929, 2020.
- Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
- Darpa’s explainable artificial intelligence (xai) program. AI magazine, 40(2):44–58, 2019.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Towards explainable action recognition by salient qualitative spatial object relation chains. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5710–5718, 2022.
- Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10236–10247, 2020.
- Complex video action reasoning via learnable markov logic network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3242–3251, 2022.
- Inaction: Interpretable action decision making for autonomous driving. In European Conference on Computer Vision, pages 370–387. Springer, 2022.
- Learned token pruning for transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 784–794, 2022.
- Learning human activities and object affordances from rgb-d videos. The International journal of robotics research, 32(8):951–970, 2013.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
- Segmentation in the perception and memory of events. Trends in cognitive sciences, 12(2):72–79, 2008.
- Decomposed cross-modal distillation for rgb-based temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2373–2383, 2023.
- Improving video model transfer with dynamic representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19280–19291, 2022.
- Effects of motion-relevant knowledge from unlabeled video to human-object interaction detection. IEEE Transactions on Neural Networks and Learning Systems, 2021.
- Learning human actions by combining global dynamics and local appearance. IEEE transactions on pattern analysis and machine intelligence, 36(12):2466–2482, 2014.
- Interpretable spatio-temporal attention for video action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
- Adapler: Speeding up inference by adaptive length reduction. arXiv preprint arXiv:2203.08991, 2022.
- Object-relation reasoning graph for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20133–20142, 2022.
- E2 (go) motion: Motion augmented event stream for egocentric action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19935–19947, 2022.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
- Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1(5):206–215, 2019.
- Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 510–526. Springer, 2016.
- Interpretable 3d human action analysis with temporal convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 20–28, 2017.
- Human action recognition from various data modalities: A review. IEEE transactions on pattern analysis and machine intelligence, 2022.
- Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
- Efficient video transformers with spatial-temporal token selection. In European Conference on Computer Vision, pages 69–86. Springer, 2022.
- Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence, 41(11):2740–2755, 2018.
- Exploring spatio-temporal graph convolution for video-based human-object interaction recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- Videos as space-time region graphs. In Proceedings of the European conference on computer vision (ECCV), pages 399–417, 2018.
- Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 284–293, 2019.
- Tr-bert: Dynamic token reduction for accelerating bert inference. arXiv preprint arXiv:2105.11618, 2021.
- A-vit: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10809–10818, 2022.
- X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8563–8573, 2022.
- Audio-adaptive activity recognition across video domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13791–13800, 2022.
- Mgsampler: An explainable sampling strategy for video action recognition. In Proceedings of the IEEE/CVF International conference on Computer Vision, pages 1513–1522, 2021.
- Explainable video action reasoning via prior knowledge and state transitions. In Proceedings of the 27th acm international conference on multimedia, pages 521–529, 2019.