Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Understanding Video Transformers via Universal Concept Discovery (2401.10831v3)

Published 19 Jan 2024 in cs.CV, cs.AI, cs.LG, and cs.RO

Abstract: This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely, we seek to explain the decision-making process of video transformers based on high-level, spatiotemporal concepts that are automatically discovered. Prior research on concept-based interpretability has concentrated solely on image-level tasks. Comparatively, video models deal with the added temporal dimension, increasing complexity and posing challenges in identifying dynamic concepts over time. In this work, we systematically address these challenges by introducing the first Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose an efficient approach for unsupervised identification of units of video transformer representations - concepts, and ranking their importance to the output of a model. The resulting concepts are highly interpretable, revealing spatio-temporal reasoning mechanisms and object-centric representations in unstructured video models. Performing this analysis jointly over a diverse set of supervised and self-supervised representations, we discover that some of these mechanism are universal in video transformers. Finally, we show that VTCD can be used for fine-grained action recognition and video object segmentation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2274–2282, 2012.
  2. Deep VIT features as dense visual descriptors. In ECCV Workshops, 2022.
  3. Discovering objects that can move. In CVPR, 2022.
  4. Network dissection: Quantifying interpretability of deep visual representations. In CVPR, 2017.
  5. Gender shades: Intersectional accuracy disparities in commercial gender classification. In ACM FAccT, 2018.
  6. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, 2017.
  7. A compositional object-based approach to learning physical dynamics. In ICLR, 2016.
  8. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In ICCV, 2021a.
  9. Transformer interpretability beyond attention visualization. In CVPR, 2021b.
  10. Why can’t I dance in the mall? Learning to mitigate scene bias in action recognition. In NeurIPS, 2019.
  11. European Commision. Laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts. European Commision, 2021.
  12. On the relationship between self-attention and convolutional layers. In ICLR, 2020.
  13. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023.
  14. Convex and semi-nonnegative matrix factorizations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1):45–55, 2008.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  16. Rosetta neurons: Mining the common units in a model zoo. In ICCV, 2023.
  17. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
  18. Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/toy_model/index.html.
  19. Deep insights into convolutional networks for video recognition. International Journal of Computer Vision, 128:420–437, 2020.
  20. Masked autoencoders as spatiotemporal learners. NeurIPS, 2022.
  21. A holistic approach to unifying automatic concept extraction and concept importance estimation. NeurIPS, 2023a.
  22. Craft: Concept recursive activation factorization for explainability. In CVPR, 2023b.
  23. Object-centric representation learning from unlabeled videos. In ACCV, 2017.
  24. What do vision transformers learn? a visual exploration. arXiv preprint arXiv:2212.06727, 2022.
  25. Video time: Properties, encoders and evaluation. In BMVC, 2018.
  26. Towards automatic concept-based explanations. In NeurIPS, 2019.
  27. The “something something” video database for learning and evaluating visual common sense. In ICCV, 2017.
  28. Kubric: A scalable dataset generator. In CVPR, 2022.
  29. A new large scale dynamic texture dataset with application to convnet understanding. In ECCV, 2018.
  30. Self-driving vehicles-An ethical overview. Philosophy & Technology, pages 1–26, 2021.
  31. The White House. President biden issues executive order on safe, secure, and trustworthy artificial intelligence. The White House, 2023.
  32. Representing objects in video as space-time volumes by combining top-down and bottom-up processes. In WACV, 2020.
  33. Is appearance free action recognition possible? In ECCV, 2022.
  34. MED-VT: Multiscale encoder-decoder video transformer with application to object segmentation. In CVPR, 2023.
  35. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In ICML, 2018.
  36. Human action recognition and prediction: A survey. International Journal of Computer Vision, 130(5):1366–1401, 2022.
  37. A deeper dive into what deep spatiotemporal networks encode: Quantifying static vs. dynamic information. In CVPR, 2022a.
  38. Quantifying and learning static vs. dynamic information in deep spatiotemporal networks. arXiv preprint arXiv:2211.01783, 2022b.
  39. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
  40. Video segmentation by tracking many figure-ground segments. In ICCV, 2013.
  41. Repair: Removing representation bias by dataset resampling. In CVPR, 2019.
  42. Resound: Towards action recognition without representation bias. In ECCV, 2018.
  43. Object-centric learning with slot attention. NeurIPS, 2020.
  44. Are sixteen heads really better than one? NeurIPS, 2019.
  45. Zoom in: An introduction to circuits. Distill, 2020. https://distill.pub/2020/circuits/zoom-in.
  46. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
  47. What do self-supervised vision transformers learn? In ICLR, 2023.
  48. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
  49. RISE: Randomized input sampling for explanation of black-box models. In BMVC, 2018.
  50. Understanding and improving robustness of vision transformers through patch-based negative augmentation. NeurIPS, 2022.
  51. Do vision transformers see like convolutional neural networks? NeurIPS, 2021.
  52. Self-supervised video transformer. In CVPR, 2022.
  53. Peter J Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65, 1987.
  54. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  55. Bridging the gap to real-world object-centric learning. In ICLR, 2023.
  56. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  57. Actor-centric relation network. In ECCV, 2018.
  58. Masked motion encoding for self-supervised video representation learning. In CVPR, 2023.
  59. Breaking the” object” in video object segmentation. In CVPR, 2023.
  60. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. NeurIPS, 2022.
  61. scikit-image: image processing in python. PeerJ, 2:e453, 2014.
  62. Tracking through containers and occluders in the wild. In CVPR, 2023.
  63. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In ACL, 2019.
  64. Teaching matters: Investigating the role of supervision in vision transformers. In CVPR, 2023.
  65. VideoMAE v2: Scaling video masked autoencoders with dual masking. In CVPR, 2023.
  66. InternVideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
  67. Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. NeurIPS, 2015.
  68. Helping hands: An object-aware ego-centric video recognition model. In ICCV, 2023.
  69. Invertible concept-based explanations for cnn models with non-negative concept activation vectors. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  70. Understanding the robustness in vision transformers. In ICML, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Matthew Kowal (15 papers)
  2. Achal Dave (31 papers)
  3. Rares Ambrus (53 papers)
  4. Adrien Gaidon (84 papers)
  5. Konstantinos G. Derpanis (48 papers)
  6. Pavel Tokmakov (32 papers)
Citations (3)

Summary

Interpretability in Video Transformers

Overview of Video Transformer Concept Discovery

Transformers have revolutionized the field of machine learning, particularly for tasks involving video. However, the complexity of these models often makes them veiled in mystery, leaving users unsure of how their internal processes lead to their conclusions. Addressing this gap, researchers have developed the Video Transformer Concept Discovery (VTCD) algorithm, positioning it as a pioneering approach for unveiling the inner workings of video transformers. VTCD is structured to parse layers of a transformer into discernible 'concepts' that are intuitive, even without a predefined label set.

The Importance of Understanding AI Decisions

Transparency within AI models is crucial, as it aligns with regulatory requirements, minimizes risks during deployment, and can inspire innovative design improvements. Particularly within video models, this interpretability is essential due to the added complexity introduced by the temporal dynamics of videos. Prior studies that have simplified the decision-making of AI have often overlooked the video domain. VTCD is designed to fill this gap by providing a look into a video transformer's reasoning by identifying significant spatio-temporal concepts and their contribution to the model's predictions.

Unveiling the Universal Mechanisms

Applying VTCD to diverse video transformer models trained for different objectives, researchers have discovered universal mechanisms. It appears that regardless of training objectives, video transformers share common spatio-temporal foundations early in their layers and exhibit object-central video representations in deeper layers. These insights largely suggest an innate capability of video transformers to sort through temporal information and understand object dynamics, even in the absence of supervised training.

Practical Applications and Performances

Beyond theoretical implications, VTCD has shown its practical worth. The algorithm can be used to refine pre-trained transformers by pruning less significant components, leading to enhanced model accuracy and efficiency. For instance, when applied to an action classification model, VTCD successfully improved accuracy by approximately 4.3% while cutting down computation by a third. This performance boost demonstrates VTCD's potential to contribute to more fine-tuned and cost-effective transformer applications in video analysis tasks.

In essence, VTCD stands as an important tool not only for demystifying the decision processes of video transformers but also for enhancing their performance for specialized tasks. As artificial intelligence continues to evolve and integrate into more domains, such tools will be increasingly valuable for making these powerful systems transparent and trustworthy.