Papers
Topics
Authors
Recent
Search
2000 character limit reached

An Integrated Framework for Multi-Granular Explanation of Video Summarization

Published 16 May 2024 in cs.CV and cs.AI | (2405.10082v1)

Abstract: In this paper, we propose an integrated framework for multi-granular explanation of video summarization. This framework integrates methods for producing explanations both at the fragment level (indicating which video fragments influenced the most the decisions of the summarizer) and the more fine-grained visual object level (highlighting which visual objects were the most influential for the summarizer). To build this framework, we extend our previous work on this field, by investigating the use of a model-agnostic, perturbation-based approach for fragment-level explanation of the video summarization results, and introducing a new method that combines the results of video panoptic segmentation with an adaptation of a perturbation-based explanation approach to produce object-level explanations. The performance of the developed framework is evaluated using a state-of-the-art summarization method and two datasets for benchmarking video summarization. The findings of the conducted quantitative and qualitative evaluations demonstrate the ability of our framework to spot the most and least influential fragments and visual objects of the video for the summarizer, and to provide a comprehensive set of visual-based explanations about the output of the summarization process.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Facilitating the production of well-tailored video summaries for sharing on social media. In Stevan Rudinac, Alan Hanjalic, Cynthia Liem, Marcel Worring, Bjorn Dor Jonsson, Bei Liu, and Yoko Yamakata, editors, MultiMedia Modeling, pages 271–278, Cham, 2024. Springer Nature Switzerland.
  2. Explainable video summarization for advancing media content production. In D.B.A. Mehdi Khosrow-Pour, editor, Encyclopedia of Information Science and Technology, Sixth Edition., page 1–24. IGI Global, Hershey, PA, 2025.
  3. Video summarization using deep neural networks: A survey. Proceedings of the IEEE, 109(11):1838–1863, 2021.
  4. Excitation backprop for rnns. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2018.
  5. Interpreting video features: A comparison of 3d convolutional networks and convolutional lstm networks. In Hiroshi Ishikawa, Cheng-Lin Liu, Tomas Pajdla, and Jianbo Shi, editors, Asian Conference on Computer Vision (ACCV) 2020, pages 411–426, Cham, 2020. Springer International Publishing.
  6. Towards visually explaining video understanding networks with perturbation. 2021 IEEE Winter Conf. on Applications of Computer Vision (WACV), pages 1119–1128, 2021.
  7. Saliency tubes: Visual explanations for spatio-temporal convolutions. In 2019 IEEE Int. Conf. on Image Processing (ICIP), pages 1830–1834, 2019.
  8. Explainable video action reasoning via prior knowledge and state transitions. In Proc. of the 27th ACM Int. Conf. on Multimedia, MM ’19, page 521–529, New York, NY, USA, 2019. Association for Computing Machinery.
  9. One-shot video graph generation for explainable action reasoning. Neurocomputing, 488:212–225, 2022.
  10. An inherently explainable model for video activity interpretation. In The Workshops of the 32nd AAAI Conf. on Artificial Intelligence, 2018.
  11. Explainable activity recognition in videos. In ACM Intelligent User Interfaces (IUI) Workshops, 2019.
  12. An explainable and efficient deep learning framework for video anomaly detection. Cluster Computing, 25(4):2715–2737, August 2022.
  13. Eval: Explainable video anomaly localization. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18717–18726, Los Alamitos, CA, USA, jun 2023. IEEE Computer Society.
  14. Interpretable anomaly detection in event sequences via sequence matching and visual comparison. IEEE Transactions on Visualization and Computer Graphics, 28(12):4531–4545, 2022.
  15. Discrete neural representations for explainable anomaly detection. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1506–1514, Los Alamitos, CA, USA, jan 2022. IEEE Computer Society.
  16. Joint detection and recounting of abnormal events by learning deep generic knowledge. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 3639–3647, Los Alamitos, CA, USA, oct 2017. IEEE Computer Society.
  17. Explaining video summarization based on the focus of attention. In 2022 IEEE Int. Symposium on Multimedia (ISM), pages 146–150, 2022.
  18. A study on the use of attention for explaining video summarization. In Proc. of the 2nd Workshop on User-Centric Narrative Summarization of Long Videos, NarSUM ’23, page 41–49, New York, NY, USA, 2023. Association for Computing Machinery.
  19. Causalainer: Causal explainer for automatic video summarization. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2630–2636, Los Alamitos, CA, USA, jun 2023. IEEE Computer Society.
  20. " why should i trust you?" explaining the predictions of any classifier. In Proc. of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016.
  21. Video k-net: A simple, strong, and unified baseline for video segmentation. In CVPR, 2022.
  22. Summarizing videos using concentrated attention and considering the uniqueness and diversity of the video frames. In Proc. of the 2022 Int. Conf. on Multimedia Retrieval, ICMR ’22, page 407–415, New York, NY, USA, 2022. Association for Computing Machinery.
  23. Creating Summaries from User Videos. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Europ. Conf. on Computer Vision (ECCV) 2014, pages 505–520, Cham, 2014. Springer International Publishing.
  24. TVSum: Summarizing web videos using titles. In 2015 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 5179–5187, June 2015.
  25. Unsupervised and explainable assessment of video similarity. In British Machine Vision Conference, 2019.
  26. Vigat: Bottom-up event recognition and explanation in video using factorized graph attention network. IEEE Access, 10:108797–108816, 2022.
  27. End-to-end video text detection with online tracking. Pattern Recognition, 113:107791, 2021.
  28. Explainable deep learning for efficient and robust pattern recognition: A survey of recent developments. Pattern Recognition, 120:108102, 2021.
  29. Learning visual explanations for dcnn-based image classifiers using an attention mechanism. In Leonid Karlinsky, Tomer Michaeli, and Ko Nishino, editors, Computer Vision – ECCV 2022 Workshops, pages 396–411, Cham, 2023. Springer Nature Switzerland.
  30. T-tame: Trainable attention mechanism for explaining convolutional networks and vision transformers. 2024.
  31. On the explainability of natural language processing deep models. ACM Comput. Surv., 55(5), dec 2022.
  32. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, Cham, 2015. Springer International Publishing.
  33. Transnet v2: An effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838, 2020.
  34. TRECVID 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking. In 2017 TREC Video Retrieval Evaluation, TRECVID 2017, Gaithersburg, MD, USA, Nov. 13-15, 2017. National Institute of Standards and Technology (NIST), 2017.
  35. Fast video shot transition localization with deep structured models. In C. V. Jawahar, Hongdong Li, Greg Mori, and Konrad Schindler, editors, Asian Conf. on Computer Vision (ACCV) 2018, pages 577–592, Cham, 2019. Springer International Publishing.
  36. A motion-driven approach for fine-grained temporal segmentation of user-generated videos. In 24th Int. Conf. on MultiMedia Modeling, MMM 2018, Bangkok, Thailand, February 5-7, 2018, Proceedings, Part I 24, pages 29–41. Springer, 2018.
  37. Summarizing Videos with Attention. In Gustavo Carneiro and Shaodi You, editors, Asian Conf. on Computer Vision (ACCV) 2018 Workshops, pages 39–54, Cham, 2019. Springer International Publishing.
  38. Exploring global diverse attention via pairwise temporal relation for video summarization. Pattern Recognition, 111:107677, 2021.
  39. Large-scale video panoptic segmentation in the wild: A benchmark. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  40. K-net: Towards unified image segmentation. NeurIPS, 2021.
  41. Maurice G Kendall. The treatment of ties in ranking problems. Biometrika, 33(3):239–251, 1945.
  42. Going deeper with convolutions. In 2015 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1–9, June 2015.
  43. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.
  44. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  45. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.