Disentangling spatio-temporal knowledge for weakly supervised object detection and segmentation in surgical video (2407.15794v4)
Abstract: Weakly supervised video object segmentation (WSVOS) enables the identification of segmentation maps without requiring an extensive training dataset of object masks, relying instead on coarse video labels indicating object presence. Current state-of-the-art methods either require multiple independent stages of processing that employ motion cues or, in the case of end-to-end trainable networks, lack in segmentation accuracy, in part due to the difficulty of learning segmentation maps from videos with transient object presence. This limits the application of WSVOS for semantic annotation of surgical videos where multiple surgical tools frequently move in and out of the field of view, a problem that is more difficult than typically encountered in WSVOS. This paper introduces Video Spatio-Temporal Disentanglement Networks (VDST-Net), a framework to disentangle spatiotemporal information using semi-decoupled knowledge distillation to predict high-quality class activation maps (CAMs). A teacher network designed to resolve temporal conflicts when specifics about object location and timing in the video are not provided works with a student network that integrates information over time by leveraging temporal dependencies. We demonstrate the efficacy of our framework on a public reference dataset and on a more challenging surgical video dataset where objects are, on average, present in less than 60\% of annotated frames. Our method outperforms state-of-the-art techniques and generates superior segmentation masks under video-level weak supervision.
- Tcam: Temporal class activation maps for object localization in weakly-labeled unconstrained videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 137–146, 2023.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Unsupervised learning of foreground object segmentation. International Journal of Computer Vision, 127:1279–1302, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Axiom-based grad-cam: Towards accurate visualization and explanation of cnns. arXiv preprint arXiv:2008.02312, 2020.
- Born again neural networks. In International conference on machine learning, pages 1607–1616. PMLR, 2018.
- Ts-cam: Token semantic coupled attention map for weakly supervised object localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2886–2895, 2021.
- Unsupervised object segmentation in video by efficient selection of highly probable positive features. In Proceedings of the IEEE international conference on computer vision, pages 5085–5093, 2017.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80. arXiv preprint arXiv:2012.12453, 2020.
- Self-erasing network for integral object attention. Advances in neural information processing systems, 31, 2018.
- Tube convolutional neural network (t-cnn) for action detection in videos. In Proceedings of the IEEE international conference on computer vision, pages 5822–5831, 2017.
- L2g: A simple local-to-global knowledge transfer framework for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16886–16896, 2022.
- Layercam: Exploring hierarchical class activation maps for localization. IEEE Transactions on Image Processing, 30:5875–5888, 2021.
- Efficient image and video co-localization with frank-wolfe algorithm. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 253–268. Springer, 2014.
- Analysing domain shift factors between videos and images for object detection. IEEE transactions on pattern analysis and machine intelligence, 38(11):2327–2334, 2016.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Overcoming catastrophic forgetting in neural networks. In Proceedings of the national academy of sciences, volume 114, pages 3521–3526. National Acad Sciences, 2017.
- Pod: Discovering primary objects in videos based on evolutionary refinement of object recurrence, background, and primary object models. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1068–1076, 2016.
- Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 695–711. Springer, 2016.
- Efficient inference in fully connected crfs with gaussian edge potentials. Advances in neural information processing systems, 24, 2011.
- Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of the IEEE international conference on computer vision, pages 3524–3533, 2017.
- Unsupervised object discovery and tracking in video collections. In Proceedings of the IEEE international conference on computer vision, pages 3173–3181, 2015.
- From sam to cams: Exploring segment anything model for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19499–19509, 2024.
- Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5267–5276, 2019.
- Motion guided attention for video salient object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7274–7283, 2019.
- Multi-scale 3d convolution network for video based person re-identification. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8618–8625, 2019.
- Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15305–15314, 2023.
- Weakly supervised instance segmentation for videos with temporal mask consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13968–13978, 2021.
- Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
- Weakly supervised convolutional lstm approach for tool tracking in laparoscopic videos. International journal of computer assisted radiology and surgery, 14:1059–1067, 2019.
- Learning object class detectors from weakly annotated video. In 2012 IEEE Conference on computer vision and pattern recognition, pages 3282–3289. IEEE, 2012.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Weakly supervised object localization and segmentation in videos. Image and Vision Computing, 56:1–12, 2016.
- Alex Sherstinsky. Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. Physica D: Nonlinear Phenomena, 404:132306, 2020.
- On regularized losses for weakly-supervised cnn segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 507–522, 2018.
- Weakly-supervised video anomaly detection with robust temporal feature magnitude learning, 2021.
- Domain randomization for transferring deep neural networks from simulation to the real world. arXiv preprint arXiv:1703.06907, 2017.
- Weakly-supervised semantic segmentation using motion cues. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 388–404. Springer, 2016.
- Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
- Semantic co-segmentation in videos. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 760–775. Springer, 2016.
- Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging, 36(1):86–97, 2016.
- Machine learning and coresets for automated real-time video segmentation of laparoscopic and robot-assisted surgery. In 2017 IEEE international conference on robotics and automation (ICRA), pages 754–759. IEEE, 2017.
- Multi-class token transformer for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4310–4319, 2022.
- Self correspondence distillation for end-to-end weakly-supervised semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 3045–3053, 2023.
- Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10448–10457, 2021.
- Separate and conquer: Decoupling co-occurrence via decomposition and representation for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3606–3615, 2024.
- Non-salient region object mining for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2623–2632, 2021.
- Weakly-supervised action localization, and action recognition using global–local attention of 3d cnn. International Journal of Computer Vision, 130(10):2349–2363, 2022.
- Spftn: A joint learning framework for localizing and segmenting objects in weakly labeled videos. IEEE Transactions on Pattern Analysis & Machine Intelligence, 42(02):475–489, 2020.
- Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11953–11962, 2022.
- Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.
- Weakly supervised 3d semantic segmentation using cross-image consensus and inter-voxel affinity relations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2834–2844, 2021.