Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability (2310.12296v1)
Abstract: Video segmentation encompasses a wide range of categories of problem formulation, e.g., object, scene, actor-action and multimodal video segmentation, for delineating task-specific scene components with pixel-level masks. Recently, approaches in this research area shifted from concentrating on ConvNet-based to transformer-based models. In addition, various interpretability approaches have appeared for transformer models and video temporal dynamics, motivated by the growing interest in basic scientific understanding, model diagnostics and societal implications of real-world deployment. Previous surveys mainly focused on ConvNet models on a subset of video segmentation tasks or transformers for classification tasks. Moreover, component-wise discussion of transformer-based video segmentation models has not yet received due focus. In addition, previous reviews of interpretability methods focused on transformers for classification, while analysis of video temporal dynamics modelling capabilities of video models received less attention. In this survey, we address the above with a thorough discussion of various categories of video segmentation, a component-wise discussion of the state-of-the-art transformer-based models, and a review of related interpretability methods. We first present an introduction to the different video segmentation task categories, their objectives, specific challenges and benchmark datasets. Next, we provide a component-wise review of recent transformer-based models and document the state of the art on different video segmentation tasks. Subsequently, we discuss post-hoc and ante-hoc interpretability methods for transformer models and interpretability methods for understanding the role of the temporal dimension in video models. Finally, we conclude our discussion with future research directions.
- Quantifying attention flow in transformers. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, 2020.
- Deep canonical correlation analysis. In Proceedings of the International Conference on Machine Learning, pages 1247–1255, 2013.
- VQA: Visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2425–2433, 2015.
- ViViT: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6836–6846, 2021.
- TarViS: A unified approach for target-based video segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18738–18748, 2023.
- On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
- Improving vision transformers by revisiting high-frequency components. In Proceedings of the European Conference on Computer Vision, pages 1–18, 2022.
- Video SnapCut: Robust video object cutout using localized classifiers. ACM Transactions on Graphics, 28(3):1–11, 2009.
- Entropy-based logic explanations of neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 6046–6054, 2022.
- Grad-SAM: Explaining transformers via gradient self-attention maps. In Proceedings of the ACM International Conference on Information and Knowledge Management, pages 2882–2887, 2021.
- ViBe: A universal background subtraction algorithm for video sequences. IEEE Transactions on Image processing, 20(6):1709–1724, 2010.
- SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9297–9307, 2019.
- Attention augmented convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3286–3295, 2019.
- Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4):509–522, 2002.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning, page 4, 2021.
- Layer-wise relevance propagation for neural networks with local renormalization layers. In Proceedings of the International Conference on Artificial Neural Networks, pages 63–71, 2016.
- B-cos networks: Alignment is all we need for interpretability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10329–10338, 2022.
- Holistically explainable vision transformers. arXiv preprint arXiv:2301.08669, 2023.
- End-to-end referring video object segmentation with multimodal transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4985–4995, 2022.
- MFCCs and Gabor features for improving continuous Arabic speech recognition in mobile communication modified. In Proceedings of the International Conference on Advanced Aspects of Software Engineering, pages 115–121, 2018.
- Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(2):88–97, 2009.
- Object segmentation by long term analysis of point trajectories. In Proceedings of the European Conference on Computer Vision, pages 282–295, 2010.
- Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1872–1886, 2013.
- Evaluation of background subtraction techniques for video surveillance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1937–1944, 2011.
- Revisiting the “video” in video-language understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2917–2927, 2022.
- Space-time mixing attention for video transformer. In Proceedings of the Conference on Advances in Neural Information Processing Systems, pages 19594–19607, 2021.
- The 2018 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1803.00557, 2018.
- The 2019 DAVIS challenge on VOS: Unsupervised multi-object segmentation. arXiv preprint arXiv:1905.00737, 2019.
- End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, pages 213–229, 2020.
- Quo vadis, action recognition? A new model and the Kinetics dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- Richard A. Caruana. Multitask learning: A knowledge-based source of inductive bias. In Proceedings of the International Conference on Machine Learning, pages 41–48, 1993.
- Richard A. Caruana. Multitask learning. Machine learning, 28:41–75, 1997.
- Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 397–406, 2021.
- Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 782–791, 2021.
- DeepDriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2722–2730, 2015.
- CrossViT: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 357–366, 2021.
- Video object segmentation via dense trajectories. IEEE Transactions on Multimedia, 17(12):2225–2234, 2015.
- On the statistical properties of the F-measure. In Proceedings of the IEEE International Conference on Quality Software, pages 146–153, 2004.
- Concept whitening for interpretable image recognition. Nature Machine Intelligence, 2(12):772–782, 2020.
- Mask2Former for video instance segmentation. arXiv preprint arXiv:2112.10764, 2021.
- Panoptic-DeepLab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12475–12485, 2020.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
- Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5559–5568, 2021.
- Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In Proceedings of the Conference on Advances in Neural Information Processing Systems, pages 11781–11794, 2021.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Treating motion as option to reduce motion dependency in unsupervised video object segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5140–5149, 2023.
- Why can’t I dance in the mall? Learning to mitigate scene bias in action recognition. In Proceedings of the Conference on Advances in Neural Information Processing Systems, 2019.
- François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1251–1258, 2017.
- Context-aware relative object queries to unify video instance and panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6377–6386, 2023.
- Twins: Revisiting the design of spatial attention in vision transformers. In Proceedings of the Conference on Advances in Neural Information Processing Systems, pages 9355–9366, 2021.
- European Commission. “Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act) and Amending Certain Union Legislative Acts”. Office for Official Publications of the European Communities Luxembourg, 2021.
- The Cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3213–3223, 2016.
- Algorithms for learning kernels based on centered alignment. The Journal of Machine Learning Research, 13(1):795–828, 2012.
- Geodesic image and video editing. ACM Transactions on Graphics, 29(5):134–1, 2010.
- Actor-action semantic segmentation with region masks. In Proceedings of the British Machine Vision Conference, 2018.
- Extraction of salient sentences from labelled documents. arXiv preprint arXiv:1412.6815, 2014.
- Action spotting and recognition based on a spatiotemporal orientation analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(3):527–540, 2012.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the North American Association for Computational Linguistics, 2019.
- MOSE: A new dataset for video object segmentation in complex scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
- Every frame counts: Joint learning of video segmentation and optical flow. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10713–10720, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- SSTVOS: Sparse spatiotemporal transformers for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5912–5921, 2021.
- Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proceedings of the IEEE, 90(7):1151–1163, 2002.
- Visualizing higher-layer features of a deep network. Technical report, University of Montreal, 2009.
- Video segmentation by non-local consensus voting. In Proceedings of the British Machine Vision Conference, 2014.
- Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6824–6835, 2021.
- JumpCut: Non-successive mask transfer and interpolation for video cutout. ACM Transactions on Graphics, 34(6):195–1, 2015.
- Describing objects by their attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1778–1785, 2009.
- Deep insights into convolutional networks for video recognition. International Journal of Computer Vision, 128(2):420–437, 2020.
- Understanding deep networks via extremal perturbations and smooth masks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2950–2958, 2019.
- Video segmentation by tracing discontinuities in a trajectory embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1846–1853, 2012.
- Object-based multiple foreground video co-segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3166–3173, 2014.
- Semantic video CNNs through representation warping. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4453–4462, 2017.
- Actor and action video segmentation from a sentence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5958–5966, 2018.
- What do vision transformers learn? A visual exploration. arXiv preprint arXiv:2212.06727, 2022.
- NAS-FPN: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7036–7045, 2019.
- Ross Girshick. Fast R-CNN. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1440–1448, 2015.
- A kernel statistical test of independence. In Proceedings of the Conference on Advances in Neural Information Processing Systems, 2007.
- ImageNet auto-annotation with segmentation propagation. International Journal of Computer Vision, 110:328–348, 2014.
- Saliency-aware video compression. IEEE Transactions on Image Processing, 23(1):19–33, 2013.
- A spatiotemporal oriented energy network for dynamic texture recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3066–3074, 2017.
- What do we understand about convolutional networks? arXiv preprint arXiv:1803.08834, 2018.
- Why convolutional networks learn oriented bandpass filters: Theory and empirical support. arXiv preprint arXiv:2011.14665, 2020.
- Density-based multifeature background subtraction with support vector machine. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(5):1017–1023, 2011.
- HTML: Hybrid Temporal-scale Multimodal Learning framework for referring video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13414–13423, 2023.
- Simultaneous detection and segmentation. In Proceedings of the European Conference on Computer Vision, pages 297–312, 2014.
- Towards deeply unified depth-aware panoptic segmentation with bi-directional guidance learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4111–4121, 2023.
- Mask R-CNN. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2961–2969, 2017.
- Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Human action recognition without human. In Proceedings of the European Conference on Computer Vision, pages 11–17, 2016.
- A generalized framework for video instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14623–14632, 2023.
- Interactive video object segmentation using global and local transfer modules. In Proceedings of the European Conference on Computer Vision, pages 297–313, 2020.
- Guided interactive video object segmentation using reliability-based attention maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7322–7330, 2021.
- Explainable deep learning for video recognition tasks: A framework & recommendations. arXiv preprint arXiv:1909.05667, 2019.
- Activity recognition using video event segmentation with text (VEST). In Proceedings of the SPIE Signal Processing, Sensor/Information Fusion, and Target Recognition, pages 225–234, 2014.
- Real-time semantic segmentation with fast attention. IEEE Robotics and Automation Letters, 6(1):263–270, 2020.
- Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. In Proceedings of the European Conference on Computer Vision, pages 786–802, 2018.
- VideoMatch: Matching based video object segmentation. In Proceedings of the European Conference on Computer Vision, pages 54–70, 2018.
- Densely connected convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4700–4708, 2017.
- FaPN: Feature-aligned pyramid network for dense image prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 864–873, 2021.
- Video instance segmentation using inter-frame communication transformers. In Proceedings of the Conference on Advances in Neural Information Processing Systems, pages 13352–13363, 2021.
- Is appearance free action recognition possible? In Proceedings of the European Conference on Computer Vision, pages 156–173, 2022.
- Michal Irani and P Anandan. A unified approach to moving object detection in 2D and 3D scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(6):577–589, 1998.
- Perceiver IO: A general architecture for structured inputs & outputs. In International Conference on Learning Representations, 2022.
- Perceiver: General perception with iterative attention. In Proceedings of the International Conference on Machine Learning, pages 4651–4664, 2021.
- Accel: A corrective fusion network for efficient semantic segmentation on video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8866–8875, 2019.
- Supervoxel-consistent foreground propagation in video. In Proceedings of the European Conference on Computer Vision, pages 656–671, 2014.
- FusionSeg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2117–2126, 2017.
- Explaining explanations: Axiomatic feature interactions for deep networks. The Journal of Machine Learning Research, 22(1):4687–4740, 2021.
- Towards understanding action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3192–3199, 2013.
- End-to-end joint semantic segmentation of actors and actions in video. In Proceedings of the European Conference on Computer Vision, pages 702–717, 2018.
- Distributed iterative gating networks for semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2844–2853, 2020.
- MED-VT: Multiscale encoder-decoder video transformer with application to object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6323–6333, 2023.
- The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Video mask transfiner for high-quality video instance segmentation. In Proceedings of the European Conference on Computer Vision, pages 731–747, 2022.
- Video object segmentation with language referring expressions. In Proceedings of the Asian Conference on Computer Vision, pages 123–141, 2019.
- Video object segmentation with referring expressions. In Proceedings of the European Conference on Computer Vision, pages 0–0, 2018.
- Fast and automatic video object segmentation and tracking for content-based applications. IEEE Transactions on Circuits and Systems for Video Technology, 12(2):122–129, 2002.
- Video panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9859–9868, 2020.
- TubeFormer-DeepLab: Video mask transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13914–13924, 2022.
- Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019.
- Concept bottleneck models. In Proceedings of the International Conference on Machine Learning, pages 5338–5348, 2020.
- Primary object segmentation in videos based on region augmentation and reduction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7417–7425, 2017.
- Sequential clique optimization for video object segmentation. In Proceedings of the European Conference on Computer Vision, pages 537–556, 2018.
- Wolfgang Köhler. Gestalt Psychology: An Introduction to New Concepts in Modern Psychology. WW Norton & Company, 1970.
- UViM: A unified modeling approach for vision with learned guiding codes. In Proceedings of the Conference on Advances in Neural Information Processing Systems, pages 26295–26308, 2022.
- Similarity of neural network representations revisited. In Proceedings of the International Conference on Machine Learning, pages 3519–3529, 2019.
- A deeper dive into what deep spatiotemporal networks encode: Quantifying static vs. dynamic information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13999–14009, 2022.
- ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
- Harold W Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955.
- A survey of deep learning applications to autonomous vehicle control. IEEE Transactions on Intelligent Transportation Systems, 22(2):712–733, 2020.
- Betrayed by motion: Camouflaged object discovery via motion segmentation. In Proceedings of the Asian Conference on Computer Vision, pages 488–503, 2020.
- Unsupervised video object segmentation via prototype memory network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5924–5934, 2023.
- Key-segments for video object segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1995–2002, 2011.
- Iteratively selecting an easy reference frame makes unsupervised video object segmentation easier. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1245–1253, 2022.
- FNet: Mixing tokens with Fourier transforms. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4296–4313, 2022.
- You only infer once: Cross-modal meta-transfer for referring video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1297–1305, 2022.
- Pyramid attention network for semantic segmentation. In Proceedings of the British Machine Vision Conference, 2018.
- Video semantic segmentation via sparse temporal transformer. In Proceedings of the ACM International Conference on Multimedia, pages 59–68, 2021.
- Visualizing and understanding neural models in NLP. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 681–691, 2016.
- Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220, 2016.
- TCOVIS: Temporally consistent online video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1097–1107, 2023.
- VISUALBERT: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
- Tube-Link: A flexible cross tube framework for universal video segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13923–13933, 2023.
- Video K-Net: A simple, strong, and unified baseline for video segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18847–18857, 2022.
- MViTv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814, 2022.
- AGSS-VOS: Attention guided single-shot video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3949–3957, 2019.
- Video instance segmentation with a propose-reduce paradigm. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1739–1748, 2021.
- SWINBERT: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17949–17958, 2022.
- Feature pyramid networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017.
- Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, pages 740–755, 2014.
- Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16(3):31–57, 2018.
- A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88, 2017.
- InstMove: Instance motion for object-centric video segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6344–6354, 2023.
- Path aggregation network for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8759–8768, 2018.
- Surveillance video parsing with single frame supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 413–421, 2017.
- RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
- Video Swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3202–3211, 2022.
- Stand-alone inter-frame attention in video models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3192–3201, 2022.
- Video object segmentation with episodic graph memory networks. In Proceedings of the European Conference on Computer Vision, pages 661–679, 2020.
- See more, know more: Unsupervised video object segmentation with co-attention Siamese networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3623–3632, 2019.
- Coherent parametric contours for interactive video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 642–650, 2016.
- A unified approach to interpreting model predictions. In Proceedings of the Conference on Advances in Neural Information Processing Systems, 2017.
- Effective approaches to attention-based neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, page 1412–1421, 2015.
- Maximum weight cliques with mutex constraints for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 670–677, 2012.
- Understanding deep image representations by inverting them. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5188–5196, 2015.
- Multimodal variational auto-encoder based audio-visual segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 954–965, 2023.
- TransVOS: Video object segmentation with transformers. arXiv preprint arXiv:2106.00588, 2021.
- Waymo open dataset: Panoramic video panoptic segmentation. In Proceedings of the European Conference on Computer Vision, pages 53–72, 2022.
- Spectrum-guided multi-granularity referring video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 920–930, 2023.
- Large-scale video panoptic segmentation in the wild: A benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21033–21043, 2022.
- VSPW: A large-scale dataset for video scene parsing in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4133–4143, 2021.
- Memory aggregation networks for efficient interactive video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10366–10375, 2020.
- Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recognition, 65:211–222, 2017.
- Insights on representational similarity in neural networks with canonical correlation. In Proceedings of the Conference on Advances in Neural Information Processing Systems, 2018.
- Video segmentation with just a few strokes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3235–3243, 2015.
- Video transformer network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3163–3172, 2021.
- Do wide and deep networks learn the same things? Uncovering how neural network representations vary with width and depth. In International Conference on Learning Representations, 2021.
- Polar relative positional encoding for video-language segmentation. In Proceedings of the International Joint Conferences on Artificial Intelligence, 2020.
- Higher order motion models and spectral clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 614–621, 2012.
- Segmentation of moving objects by long term video analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6):1187–1200, 2013.
- Law Council of Ontario. “Regulating AI: Critical issues and choices”. LCO Issue Paper, 2021.
- Fast video object segmentation by reference-guided mask propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7376–7385, 2018.
- Fast user-guided video object segmentation by interaction-and-propagation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5247–5256, 2019.
- Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9226–9235, 2019.
- Nobuyuki Otsu. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics, 9(1):62–66, 1979.
- IA-RED22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Interpretability-aware redundancy reduction for vision transformers. In Proceedings of the Conference on Advances in Neural Information Processing Systems, pages 24898–24911, 2021.
- Wnet: Audio-guided video object segmentation via wavelet-based cross-modal denoising networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1320–1331, 2022.
- Fast object segmentation in unconstrained video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1777–1784, 2013.
- Per-clip video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1352–1361, 2022.
- Image transformer. In Proceedings of the International Conference on Machine Learning, pages 4055–4064, 2018.
- Local memory attention for fast video semantic segmentation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1102–1109, 2021.
- Efficient video semantic segmentation with labels propagation and refinement. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2873–2882, 2020.
- Hierarchical feature alignment network for unsupervised video object segmentation. In Proceedings of the European Conference on Computer Vision, pages 596–613, 2022.
- A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 724–732, 2016.
- RISE: Randomized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421, 2018.
- The 2017 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Learning object class detectors from weakly annotated video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3282–3289, 2012.
- Occluded video instance segmentation: A benchmark. International Journal of Computer Vision, 130(8):2022–2039, 2022.
- AttCAT: Explaining transformers via attentive class activation tokens. In Proceedings of the Conference on Advances in Neural Information Processing Systems, 2022.
- ViP-DeepLab: Learning visual perception with depth-aware video panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3997–4008, 2021.
- SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Proceedings of the Conference on Advances in Neural Information Processing Systems, pages 6078–6087, 2017.
- Do vision transformers see like convolutional neural networks? In Proceedings of the Conference on Advances in Neural Information Processing Systems, pages 12116–12128, 2021.
- Segmenting salient objects from images and videos. In Proceedings of the European Conference on Computer Vision, pages 366–379, 2010.
- We don’t need thousand proposals: Single shot actor-action detection in videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2960–2969, 2021.
- A generalist agent. Transactions on Machine Learning Research, 11:1–42, 2022.
- Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems, 2015.
- Reciprocal transformations for unsupervised video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15455–15464, 2021.
- Statistical background modeling for non-stationary camera. Pattern Recognition Letters, 24(1-3):183–196, 2003.
- “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016.
- Playing for benchmarks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2213–2222, 2017.
- Playing for data: Ground truth from computer games. In Proceedings of the European Conference on Computer Vision, pages 102–118, 2016.
- Attention-based interpretability with concept transformers. In International Conference on Learning Representations, 2021.
- Hierarchical Object Representations in the Visual Cortex and Computer Vision, volume 9 of Frontiers in Computational Neuroscience. 2015.
- ERFNet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1):263–272, 2017.
- Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019.
- ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
- Evaluating the visualization of what a deep neural network has learned. IEEE Transactions on Neural Networks and Learning Systems, 28(11):2660–2673, 2016.
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- A framework for learning ante-hoc explainable models via concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10286–10295, 2022.
- Video transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. to appear.
- Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 618–626, 2017.
- URVOS: Unified referring video object segmentation network with a large-scale benchmark. In Proceedings of the European Conference on Computer Vision, pages 208–223, 2020.
- Kernelized memory network for video object segmentation. In Proceedings of the European Conference on Computer Vision, pages 629–645, 2020.
- Hierarchical memory matching network for video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12889–12898, 2021.
- Only time can tell: Discovering temporal data for temporal modeling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 535–544, 2021.
- Motion segmentation and tracking using normalized cuts. In Proceedings of the IEEE International Conference on Computer Vision, pages 1154–1160, 1998.
- Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
- Jaccard index compensation for object segmentation evaluation. In IEEE International Conference on Image Processing, pages 4457–4461, 2014.
- Learning important features through propagating activation differences. In Proceedings of the International Conference on Machine Learning, pages 3145–3153, 2017.
- Video object segmentation using teacher-student adaptation in a human robot interaction (HRI) setting. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 50–56, 2019.
- Deep inside convolutional networks: Visualising image classification models and saliency maps. In International Conference on Learning Representations, 2014.
- Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, pages 1–14, 2015.
- Feature selection via dependence maximization. Journal of Machine Learning Research, 13(5):1393–1434, 2012.
- UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Learning patterns of activity using real-time tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):747–757, 2000.
- Coarse-to-fine feature mining for video semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3126–3137, 2022.
- Axiomatic attribution for deep networks. In Proceedings of the International Conference on Machine Learning, pages 3319–3328, 2017.
- Going deeper with convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
- LXMERT: Learning cross-modality encoder representations from transformers. Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing, 2019.
- EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10781–10790, 2020.
- Learning global additive explanations for neural nets using model distillation. 2018.
- Temporal collection and distribution for referring video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15466–15476, 2023.
- Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
- Video instance segmentation via multi-scale spatio-temporal split attention transformer. In Proceedings of the European Conference on Computer Vision, pages 666–681, 2022.
- Video instance segmentation via multi-scale spatio-temporal split attention transformer. In In Proceedings of the European Conference on Computer Vision, 2022.
- A survey on video segmentation. In Proceedings of the International Conference on Advanced Computing, Networking, and Informatics, pages 903–912, 2014.
- Robust and efficient foreground analysis for real-time video surveillance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1182–1187, 2005.
- Conditional convolutions for instance segmentation. In Proceedings of the European Conference on Computer Vision, pages 282–298, 2020.
- Breaking the” object” in video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22836–22845, 2023.
- Concerning Bayesian motion segmentation, model averaging, matching and the trifocal tensor. In Proceedings of the European Conference on Computer Vision, pages 511–527, 1998.
- Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, pages 10347–10357, 2021.
- Semantic co-segmentation in videos. In Proceedings of the European Conference on Computer Vision, pages 760–775, 2016.
- Vision transformers for action recognition: A survey. arXiv preprint arXiv:2209.05700, 2022.
- Tracking through containers and occluders in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13802–13812, 2023.
- Transfer learning improves supervised image segmentation across imaging protocols. IEEE Transactions on Medical Imaging, 34(5):1018–1030, 2014.
- Multi-task learning for dense prediction tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3614–3633, 2021.
- Attention is all you need. In Proceedings of the Conference on Advances in Neural Information Processing Systems, 2017.
- Jesse Vig. Visualizing attention in transformer-based language representation models. arXiv preprint arXiv:1904.02679, 2019.
- FEELVOS: Fast end-to-end embedding learning for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9481–9490, 2019.
- MOTS: Multi-object tracking and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7942–7951, 2019.
- Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, page 5797–5808, 2019.
- Predicting actions from static scenes. In Proceedings of the European Conference on Computer Vision, pages 421–436, 2014.
- Temporal memory attention for video semantic segmentation. In Proceedings of the IEEE International Conference on Image Processing, pages 2254–2258, 2021.
- MaX-DeepLab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5463–5474, 2021.
- Deformable video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14053–14062, 2022.
- Unidentified video objects: A benchmark for dense, open-world segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10776–10785, 2021.
- Zero-shot video object segmentation via attentive graph neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9236–9245, 2019.
- Saliency-aware geodesic video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3395–3402, 2015.
- Selective video object cutout. IEEE Transactions on Image Processing, 26(12):5645–5655, 2017.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 568–578, 2021.
- FreeSOLO: Learning to segment objects without annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14176–14186, 2022.
- Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3024–3033, 2021.
- Cut and learn for unsupervised object detection and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3124–3134, 2023.
- End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8741–8750, 2021.
- STEP: Segmenting and tracking every pixel. In Proceedings of the Conference on Neural Information Processing Systems, 2021.
- Caltech-UCSD birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
- Learning to associate every segment for video panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2705–2714, 2021.
- Pfinder: Real-time tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):780–785, 1997.
- MeMViT: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022.
- Efficient video instance segmentation via tracklet query and proposal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 959–968, 2022.
- Language as queries for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4974–4984, 2022.
- SeqFormer: Sequential transformer for video instance segmentation. In Proceedings of the European Conference on Computer Vision, pages 553–569, 2022.
- In defense of online models for video instance segmentation. In In Proceedings of the European Conference on Computer Vision, pages 588–605, 2022.
- Beyond sparsity: Tree regularization of deep models for interpretability. In Proceedings of the AAAI conference on artificial intelligence, 2018.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
- Gradient-free activation maximization for identifying effective stimuli. arXiv preprint arXiv:1905.00378, 2019.
- SegFormer: Simple and efficient design for semantic segmentation with transformers. In Proceedings of the Conference on Advances in Neural Information Processing Systems, pages 12077–12090, 2021.
- Actor-action semantic segmentation with grouping process models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3083–3092, 2016.
- Can humans fly? Action understanding with multiple classes of actors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2264–2273, 2015.
- Auto-FPN: Automatic network architecture adaptation for object detection beyond classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6649–6658, 2019.
- Deep interactive object selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 373–381, 2016.
- YouTube-VOS: Sequence-to-sequence video object segmentation. In Proceedings of the European Conference on Computer Vision, pages 585–601, 2018.
- Video scene parsing: An overview of deep learning methods and datasets. Computer Vision and Image Understanding, 201, 2020.
- The 3rd large-scale video object segmentation challenge - video instance segmentation track, 2021.
- Video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5188–5197, 2019.
- Crossover learning for fast online video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8043–8052, 2021.
- Temporally efficient vision transformer for video instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2885–2895, 2022.
- Collaborative video object segmentation by foreground-background integration. In Proceedings of the European Conference on Computer Vision, pages 332–348, 2020.
- Associating objects with transformers for video object segmentation. In Proceedings of the Conference on Advances in Neural Information Processing Systems, pages 2491–2502, 2021.
- Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4701–4712, 2021.
- Video object segmentation and tracking: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(4):1–47, 2020.
- CTVIS: Consistent training for online video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 899–908, 2023.
- PolyphonicFormer: Unified query learning for depth-aware video panoptic segmentation. In Proceedings of the European Conference on Computer Vision, pages 582–599, 2022.
- Isomer: Isomerous transformer for zero-shot video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 966–976, 2023.
- Wide residual networks. In Procedings of the British Machine Vision Conference, 2016.
- Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision, pages 818–833, 2014.
- Video object segmentation through spatially accurate and temporally dense extraction of primary object regions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 628–635, 2013.
- Interpretable convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8827–8836, 2018.
- Interpreting CNNs via decision trees. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6261–6270, 2019.
- DVIS: Decoupled video instance segmentation framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
- A survey on neural network interpretability. IEEE Transactions on Emerging Topics in Computational Intelligence, 5(5):726–742, 2021.
- Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2021.
- Yu-Jin Zhang. An overview of image and video segmentation in the last 40 years. Advances in Image and Video Segmentation, pages 1–16, 2006.
- He Zhao and Richard P Wildes. Interpretable deep feature propagation for early action recognition. arXiv preprint arXiv:2107.05122, 2021.
- Trajectory convolution for action recognition. In Proceedings of the Conference on Advances in Neural Information Processing Systems, 2018.
- Discontinuity-aware video object cutout. ACM Transactions on Graphics, 31(6):1–10, 2012.
- Squeeze-and-attention networks for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13065–13074, 2020.
- Object detectors emerge in deep scene CNNs. In International Conference on Learning Representations, 2015.
- Learning deep features for discriminative localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2921–2929, 2016.
- Audio-visual segmentation. In Proceedings of the European Conference on Computer Vision, pages 386–403, 2022.
- A survey on deep learning technique for video segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7099–7122, 2022.
- Motion-attentive transition for zero-shot video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 13066–13073, 2020.
- Cascaded human-object interaction recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4263–4272, 2020.
- Slot-VPS: Object-centric representation learning for video panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3093–3103, 2022.
- Deformable DETR: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations, 2021.
- Rezaul Karim (9 papers)
- Richard P. Wildes (20 papers)