Understanding Video Transformers via Universal Concept Discovery (2401.10831v3)
Abstract: This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely, we seek to explain the decision-making process of video transformers based on high-level, spatiotemporal concepts that are automatically discovered. Prior research on concept-based interpretability has concentrated solely on image-level tasks. Comparatively, video models deal with the added temporal dimension, increasing complexity and posing challenges in identifying dynamic concepts over time. In this work, we systematically address these challenges by introducing the first Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose an efficient approach for unsupervised identification of units of video transformer representations - concepts, and ranking their importance to the output of a model. The resulting concepts are highly interpretable, revealing spatio-temporal reasoning mechanisms and object-centric representations in unstructured video models. Performing this analysis jointly over a diverse set of supervised and self-supervised representations, we discover that some of these mechanism are universal in video transformers. Finally, we show that VTCD can be used for fine-grained action recognition and video object segmentation.
- SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2274–2282, 2012.
- Deep VIT features as dense visual descriptors. In ECCV Workshops, 2022.
- Discovering objects that can move. In CVPR, 2022.
- Network dissection: Quantifying interpretability of deep visual representations. In CVPR, 2017.
- Gender shades: Intersectional accuracy disparities in commercial gender classification. In ACM FAccT, 2018.
- Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, 2017.
- A compositional object-based approach to learning physical dynamics. In ICLR, 2016.
- Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In ICCV, 2021a.
- Transformer interpretability beyond attention visualization. In CVPR, 2021b.
- Why can’t I dance in the mall? Learning to mitigate scene bias in action recognition. In NeurIPS, 2019.
- European Commision. Laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts. European Commision, 2021.
- On the relationship between self-attention and convolutional layers. In ICLR, 2020.
- Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023.
- Convex and semi-nonnegative matrix factorizations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1):45–55, 2008.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Rosetta neurons: Mining the common units in a model zoo. In ICCV, 2023.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
- Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/toy_model/index.html.
- Deep insights into convolutional networks for video recognition. International Journal of Computer Vision, 128:420–437, 2020.
- Masked autoencoders as spatiotemporal learners. NeurIPS, 2022.
- A holistic approach to unifying automatic concept extraction and concept importance estimation. NeurIPS, 2023a.
- Craft: Concept recursive activation factorization for explainability. In CVPR, 2023b.
- Object-centric representation learning from unlabeled videos. In ACCV, 2017.
- What do vision transformers learn? a visual exploration. arXiv preprint arXiv:2212.06727, 2022.
- Video time: Properties, encoders and evaluation. In BMVC, 2018.
- Towards automatic concept-based explanations. In NeurIPS, 2019.
- The “something something” video database for learning and evaluating visual common sense. In ICCV, 2017.
- Kubric: A scalable dataset generator. In CVPR, 2022.
- A new large scale dynamic texture dataset with application to convnet understanding. In ECCV, 2018.
- Self-driving vehicles-An ethical overview. Philosophy & Technology, pages 1–26, 2021.
- The White House. President biden issues executive order on safe, secure, and trustworthy artificial intelligence. The White House, 2023.
- Representing objects in video as space-time volumes by combining top-down and bottom-up processes. In WACV, 2020.
- Is appearance free action recognition possible? In ECCV, 2022.
- MED-VT: Multiscale encoder-decoder video transformer with application to object segmentation. In CVPR, 2023.
- Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In ICML, 2018.
- Human action recognition and prediction: A survey. International Journal of Computer Vision, 130(5):1366–1401, 2022.
- A deeper dive into what deep spatiotemporal networks encode: Quantifying static vs. dynamic information. In CVPR, 2022a.
- Quantifying and learning static vs. dynamic information in deep spatiotemporal networks. arXiv preprint arXiv:2211.01783, 2022b.
- Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
- Video segmentation by tracking many figure-ground segments. In ICCV, 2013.
- Repair: Removing representation bias by dataset resampling. In CVPR, 2019.
- Resound: Towards action recognition without representation bias. In ECCV, 2018.
- Object-centric learning with slot attention. NeurIPS, 2020.
- Are sixteen heads really better than one? NeurIPS, 2019.
- Zoom in: An introduction to circuits. Distill, 2020. https://distill.pub/2020/circuits/zoom-in.
- In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- What do self-supervised vision transformers learn? In ICLR, 2023.
- A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
- RISE: Randomized input sampling for explanation of black-box models. In BMVC, 2018.
- Understanding and improving robustness of vision transformers through patch-based negative augmentation. NeurIPS, 2022.
- Do vision transformers see like convolutional neural networks? NeurIPS, 2021.
- Self-supervised video transformer. In CVPR, 2022.
- Peter J Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65, 1987.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Bridging the gap to real-world object-centric learning. In ICLR, 2023.
- UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Actor-centric relation network. In ECCV, 2018.
- Masked motion encoding for self-supervised video representation learning. In CVPR, 2023.
- Breaking the” object” in video object segmentation. In CVPR, 2023.
- VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. NeurIPS, 2022.
- scikit-image: image processing in python. PeerJ, 2:e453, 2014.
- Tracking through containers and occluders in the wild. In CVPR, 2023.
- Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In ACL, 2019.
- Teaching matters: Investigating the role of supervision in vision transformers. In CVPR, 2023.
- VideoMAE v2: Scaling video masked autoencoders with dual masking. In CVPR, 2023.
- InternVideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
- Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. NeurIPS, 2015.
- Helping hands: An object-aware ego-centric video recognition model. In ICCV, 2023.
- Invertible concept-based explanations for cnn models with non-negative concept activation vectors. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
- Understanding the robustness in vision transformers. In ICML, 2022.
- Matthew Kowal (15 papers)
- Achal Dave (31 papers)
- Rares Ambrus (53 papers)
- Adrien Gaidon (84 papers)
- Konstantinos G. Derpanis (48 papers)
- Pavel Tokmakov (32 papers)