Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring Explainability in Video Action Recognition (2404.09067v1)

Published 13 Apr 2024 in cs.CV and cs.AI

Abstract: Image Classification and Video Action Recognition are perhaps the two most foundational tasks in computer vision. Consequently, explaining the inner workings of trained deep neural networks is of prime importance. While numerous efforts focus on explaining the decisions of trained deep neural networks in image classification, exploration in the domain of its temporal version, video action recognition, has been scant. In this work, we take a deeper look at this problem. We begin by revisiting Grad-CAM, one of the popular feature attribution methods for Image Classification, and its extension to Video Action Recognition tasks and examine the method's limitations. To address these, we introduce Video-TCAV, by building on TCAV for Image Classification tasks, which aims to quantify the importance of specific concepts in the decision-making process of Video Action Recognition models. As the scalable generation of concepts is still an open problem, we propose a machine-assisted approach to generate spatial and spatiotemporal concepts relevant to Video Action Recognition for testing Video-TCAV. We then establish the importance of temporally-varying concepts by demonstrating the superiority of dynamic spatiotemporal concepts over trivial spatial concepts. In conclusion, we introduce a framework for investigating hypotheses in action recognition and quantitatively testing them, thus advancing research in the explainability of deep neural networks used in video action recognition.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. Local explanation methods for deep neural networks lack sensitivity to parameter values. arXiv preprint arXiv:1810.03307, 2018.
  2. On the robustness of interpretability methods. CoRR, abs/1806.08049, 2018.
  3. Assessing the trustworthiness of saliency maps for localizing abnormalities in medical imaging. Radiology: Artificial Intelligence, 3(6), 2021.
  4. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023.
  5. Brittle interpretations: The vulnerability of TCAV and other concept-based explainability tools to adversarial attack. CoRR, abs/2110.07120, 2021.
  6. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  7. Scene perception in the human brain. Annual Review of Vision Science, 5(1):373–397, 2019.
  8. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  9. Interpretation of neural networks is fragile. In Proceedings of the AAAI conference on artificial intelligence, pages 3681–3688, 2019.
  10. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
  11. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pages 2668–2677. PMLR, 2018.
  12. The (un) reliability of saliency methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pages 267–280. Springer, 2019.
  13. Video swin transformer, 2021.
  14. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
  15. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017.
  16. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, 2022.
  17. Area V5 of the Human Brain: Evidence from a Combined Study Using Positron Emission Tomography and Magnetic Resonance Imaging. Cerebral Cortex, 3(2):79–94, 1993.
  18. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Avinab Saha (8 papers)
  2. Shashank Gupta (57 papers)
  3. Sravan Kumar Ankireddy (11 papers)
  4. Karl Chahine (4 papers)
  5. Joydeep Ghosh (74 papers)