Pixel-Wise Recognition for Holistic Surgical Scene Understanding (2401.11174v3)
Abstract: This paper presents the Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies (GraSP) dataset, a curated benchmark that models surgical scene understanding as a hierarchy of complementary tasks with varying levels of granularity. Our approach encompasses long-term tasks, such as surgical phase and step recognition, and short-term tasks, including surgical instrument segmentation and atomic visual actions detection. To exploit our proposed benchmark, we introduce the Transformers for Actions, Phases, Steps, and Instrument Segmentation (TAPIS) model, a general architecture that combines a global video feature extractor with localized region proposals from an instrument segmentation model to tackle the multi-granularity of our benchmark. Through extensive experimentation in ours and alternative benchmarks, we demonstrate TAPIS's versatility and state-of-the-art performance across different tasks. This work represents a foundational step forward in Endoscopic Vision, offering a novel framework for future research towards holistic surgical scene understanding.
- Youtube-8m: A large-scale video classification benchmark, in: arXiv:1609.08675. URL: https://arxiv.org/pdf/1609.08675v1.pdf.
- A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery. IEEE Transactions on Biomedical Engineering 64, 2025–2041.
- Cataracts: Challenge on automatic tool annotation for cataract surgery. Medical Image Analysis 52, 24–41. URL: https://www.sciencedirect.com/science/article/pii/S136184151830865X, doi:https://doi.org/10.1016/j.media.2018.11.008.
- 2018 robotic scene segmentation challenge. arXiv:2001.11190.
- 2017 robotic instrument segmentation challenge. arXiv:1902.06426.
- Gesture recognition in robotic surgery with multimodal attention. IEEE Transactions on Medical Imaging 41, 1677--1687. doi:10.1109/TMI.2022.3147640.
- Matis: Masked-attention transformers for surgical instrument segmentation, in: 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), pp. 1--5. doi:10.1109/ISBI53787.2023.10230819.
- From forks to forceps: A new framework for instance segmentation of surgical instruments, in: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), IEEE Computer Society, Los Alamitos, CA, USA. pp. 6180--6190. URL: https://doi.ieeecomputersociety.org/10.1109/WACV56688.2023.00613, doi:10.1109/WACV56688.2023.00613.
- Midwest and Its Children: The Psychological Ecology of an American Town. Row, Peterson. URL: https://books.google.com.co/books?id=rPQhAAAAMAAJ.
- The saras endoscopic surgeon action detection (esad) dataset: Challenges and methods. arXiv:2104.03178.
- Actions as space-time shapes, in: Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, IEEE. pp. 1395--1402.
- Comparative evaluation of instrument segmentation and tracking methods in minimally invasive surgery. arXiv:1805.02475.
- Activitynet: A large-scale video benchmark for human activity understanding, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 961--970. doi:10.1109/CVPR.2015.7298698.
- End-to-end object detection with transformers, in: European conference on computer vision, Springer. pp. 213--229.
- Masked-attention mask transformer for universal image segmentation, in: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1280--1289. doi:10.1109/CVPR52688.2022.00135.
- Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems 34, 17864--17875.
- Is Robotic-Assisted surgery better? AMA J Ethics 25, E598--604.
- Opera: Attention-regularized transformers for surgical phase recognition, in: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (Eds.), Medical Image Computing and Computer Assisted Intervention -- MICCAI 2021, Springer International Publishing, Cham. pp. 604--614.
- Improving augmented reality through deep learning: Real-time instrument delineation in robotic renal surgery. European Urology 84, 86--91. URL: https://www.sciencedirect.com/science/article/pii/S0302283823026337, doi:https://doi.org/10.1016/j.eururo.2023.02.024.
- Deep learning in surgical workflow analysis: A review of phase and step recognition. IEEE Journal of Biomedical and Health Informatics 27, 5405--5417. doi:10.1109/JBHI.2023.3311628.
- BERT: Pre-training of deep bidirectional transformers for language understanding, in: Burstein, J., Doran, C., Solorio, T. (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota. pp. 4171--4186. URL: https://aclanthology.org/N19-1423, doi:10.18653/v1/N19-1423.
- Learnable query initialization for surgical instrument instance segmentation, in: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda-Mahmood, T., Taylor, R. (Eds.), Medical Image Computing and Computer Assisted Intervention -- MICCAI 2023, Springer Nature Switzerland, Cham. pp. 728--738.
- Exploring segment-level semantics for online phase recognition from surgical videos. IEEE Transactions on Medical Imaging 41, 3309--3319. doi:10.1109/TMI.2022.3182995.
- An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations. URL: https://openreview.net/forum?id=YicbFdNTTy.
- Combined 2d and 3d tracking of surgical instruments for minimally invasive and robotic-assisted surgery. International Journal of Computer Assisted Radiology and Surgery 11, 1109--1119. URL: https://doi.org/10.1007/s11548-016-1393-4, doi:10.1007/s11548-016-1393-4.
- Articulated multi-instrument 2-d pose estimation using fully convolutional networks. IEEE Transactions on Medical Imaging 37, 1276--1287. doi:10.1109/TMI.2017.2787672.
- The pascal visual object classes challenge: A retrospective. International journal of computer vision 111, 98--136.
- Multiscale vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824--6835.
- Automation and Autonomy in Robotic Surgery. Springer International Publishing, Cham. pp. 237--255. URL: https://doi.org/10.1007/978-3-030-53594-0_23, doi:10.1007/978-3-030-53594-0_23.
- The future of endoscopic navigation: A review of advanced endoscopic vision technology. IEEE Access 9, 41144--41167. doi:10.1109/ACCESS.2021.3065104.
- Trans-svnet: accurate phase recognition from surgical videos via hybrid embedding aggregation transformer, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 593--603.
- Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. Modeling and Monitoring of Computer Assisted Interventions (M2CAI) – MICCAI Workshop .
- Toolnet: Holistically-nested real-time segmentation of robotic surgical tools, in: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5717--5722. doi:10.1109/IROS.2017.8206462.
- Robotic surgery: an evolution in practice. Journal of Surgical Protocols and Research Methodologies 2022, snac003. URL: https://doi.org/10.1093/jsprm/snac003, doi:10.1093/jsprm/snac003.
- Isinet: An instance-based approach for surgical instrument segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 595--605.
- A real-time spatiotemporal ai model analyzes skill in open surgical videos. arXiv:2112.07219.
- The “something something” video database for learning and evaluating visual common sense, in: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE Computer Society, Los Alamitos, CA, USA. pp. 5843--5851. URL: https://doi.ieeecomputersociety.org/10.1109/ICCV.2017.622, doi:10.1109/ICCV.2017.622.
- Cadis: Cataract dataset for surgical rgb-image segmentation. Medical Image Analysis 71, 102053. URL: https://www.sciencedirect.com/science/article/pii/S1361841521000992, doi:https://doi.org/10.1016/j.media.2021.102053.
- Ava: A video dataset of spatio-temporally localized atomic visual actions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047--6056.
- Multi-mode information fusion navigation system for robot-assisted vascular interventional surgery. BMC Surgery 23, 51. URL: https://doi.org/10.1186/s12893-023-01944-5, doi:10.1186/s12893-023-01944-5.
- Mask r-cnn, in: Proceedings of the IEEE international conference on computer vision, pp. 2961--2969.
- Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770--778.
- Micro-surgical anastomose workflow recognition challenge report. Computer Methods and Programs in Biomedicine 212, 106452. URL: https://www.sciencedirect.com/science/article/pii/S0169260721005265, doi:https://doi.org/10.1016/j.cmpb.2021.106452.
- The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155, 1--23.
- Intuitive Inc., 2014. da vinci si surgycal system user manual. https://fcc.report/FCC-ID/2AAZF-CHB01/2607924.pdf. [Accessed on 11-12-2023].
- Intuitive Inc., 2023. Da vinci x/xi system instrument and accessory catalog. https://www.intuitive.com/en-us/-/media/ISI/Intuitive/Pdf/xi-x-ina-catalog-no-pricing-us-1052082.pdf. [Accessed 11-12-2023].
- Learning and reasoning with the graph structure representation in robotic surgery, in: Martel, A.L., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D., Joskowicz, L. (Eds.), Medical Image Computing and Computer Assisted Intervention -- MICCAI 2020, Springer International Publishing, Cham. pp. 627--636.
- Towards understanding action recognition, in: Proceedings of the IEEE international conference on computer vision, pp. 3192--3199.
- Incorporating temporal prior from motion flow for instrument segmentation in minimally invasive surgery video, in: Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.T., Khan, A. (Eds.), Medical Image Computing and Computer Assisted Intervention -- MICCAI 2019, Springer International Publishing, Cham. pp. 440--448.
- Toronto annotation suite. https://aidemos.cs.toronto.edu/toras.
- Large-scale video classification with convolutional neural networks, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725--1732.
- Context-aware augmented reality in laparoscopic surgery. Computerized Medical Imaging and Graphics 37, 174--182. URL: https://www.sciencedirect.com/science/article/pii/S0895611113000335, doi:https://doi.org/10.1016/j.compmedimag.2013.03.003. special Issue on Mixed Reality Guidance of Therapy - Towards Clinical Implementation.
- The kinetics human action video dataset. arXiv:1705.06950.
- Efficient visual event detection using volumetric features, in: Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, IEEE. pp. 166--173.
- Proposing novel methods for gynecologic surgical action recognition on laparoscopic videos. Multimedia Tools and Applications 79. doi:10.1007/s11042-020-09540-y.
- Comprehensive learning curve of robotic surgery: Discovery from a multicenter prospective trial of robotic gastrectomy. Annals of Surgery 273, 949–956. URL: http://dx.doi.org/10.1097/sla.0000000000003583, doi:10.1097/sla.0000000000003583.
- Artificial intelligence for context-aware surgical guidance in complex robot-assisted oncological procedures: An exploratory feasibility study. European Journal of Surgical Oncology , 106996URL: https://www.sciencedirect.com/science/article/pii/S0748798323006224, doi:https://doi.org/10.1016/j.ejso.2023.106996.
- Lapformer: surgical tool detection in laparoscopic surgical video using transformer architecture. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 9, 302--307.
- Hmdb: a large video database for human motion recognition, in: 2011 International conference on computer vision, IEEE. pp. 2556--2563.
- Robotic surgery: A current perspective. Annals of Surgery 239. URL: https://journals.lww.com/annalsofsurgery/fulltext/2004/01000/robotic_surgery__a_current_perspective.3.aspx.
- The ava-kinetics localized human actions video dataset. arXiv:2005.00214.
- Mask dino: Towards a unified transformer-based framework for object detection and segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3041--3050.
- Microsoft coco: Common objects in context, in: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (Eds.), Computer Vision -- ECCV 2014, Springer International Publishing, Cham. pp. 740--755.
- Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012--10022.
- Video swin transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202--3211.
- SGDR: stochastic gradient descent with warm restarts, in: 5th International Conference on Learning Representations ICLR, OpenReview.net. URL: https://openreview.net/forum?id=Skq89Scxx.
- Decoupled weight decay regularization, in: International Conference on Learning Representations. URL: https://openreview.net/forum?id=Bkg6RiCqY7.
- Surgical data science – from concepts toward clinical translation. Medical Image Analysis 76, 102306. URL: https://www.sciencedirect.com/science/article/pii/S1361841521003510, doi:https://doi.org/10.1016/j.media.2021.102306.
- Surgical data science for next-generation interventions. Nature Biomedical Engineering 1, 691--696. URL: https://doi.org/10.1038/s41551-017-0132-7, doi:10.1038/s41551-017-0132-7.
- Heidelberg colorectal data set for surgical data science in the sensor operating room. Scientific Data 8, 1--11. URL: https://api.semanticscholar.org/CorpusID:218538016.
- Actions in context, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE. pp. 2929--2936.
- Computer vision in surgery: from potential to clinical value. npj Digital Medicine 5, 163. URL: https://doi.org/10.1038/s41746-022-00707-5, doi:10.1038/s41746-022-00707-5.
- Nephrec9. Zenodo doi:https://doi.org/10.5281/zenodo.1066831.
- Recognition of instrument-tissue interactions in endoscopic videos via action triplets, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 364--374.
- Data splits and metrics for method benchmarking on surgical action triplet datasets. arXiv:2204.05235.
- Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos. Medical Image Analysis 78, 102433. URL: https://www.sciencedirect.com/science/article/pii/S1361841522000846, doi:https://doi.org/10.1016/j.media.2022.102433.
- Trecvid 2013 – an overview of the goals, tasks, data, evaluation mechanisms, and metrics .
- Statistical modeling and recognition of surgical workflow. Medical Image Analysis 16, 632--641. URL: https://www.sciencedirect.com/science/article/pii/S1361841510001131, doi:https://doi.org/10.1016/j.media.2010.10.001. computer Assisted Interventions.
- Campbell Walsh Wein Urology, E-Book. Elsevier, Philadelphia.
- A review of augmented reality in robotic-assisted surgery. IEEE Transactions on Medical Robotics and Bionics 2, 1--16. doi:10.1109/TMRB.2019.2957061.
- A review of haptic feedback in tele-operated robotic surgery. Journal of Medical Engineering & Technology 44, 247--254. URL: https://doi.org/10.1080/03091902.2020.1772391, doi:10.1080/03091902.2020.1772391, arXiv:https://doi.org/10.1080/03091902.2020.1772391. pMID: 32573288.
- Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 1137--1149. doi:10.1109/TPAMI.2016.2577031.
- Action mach a spatio-temporal maximum average correlation height filter for action recognition, in: 2008 IEEE conference on computer vision and pattern recognition, IEEE. pp. 1--8.
- Comparative validation of multi-instance instrument segmentation in endoscopy: Results of the robust-mis 2019 challenge. Medical Image Analysis 70, 101920. URL: https://www.sciencedirect.com/science/article/pii/S136184152030284X, doi:https://doi.org/10.1016/j.media.2020.101920.
- Sensor substitution for video-based action recognition, in: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5230--5237. doi:10.1109/IROS.2016.7759769.
- Cataract-101: video dataset of 101 cataract surgeries, in: César, P., Zink, M., Murray, N. (Eds.), Proceedings of the 9th ACM Multimedia Systems Conference, MMSys 2018, Amsterdam, The Netherlands, June 12-15, 2018, ACM. pp. 421--425. URL: https://doi.org/10.1145/3204949.3208137, doi:10.1145/3204949.3208137.
- Recognizing human actions: a local svm approach, in: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., IEEE. pp. 32--36.
- Global-reasoned multi-task learning model for surgical scene understanding. IEEE Robotics and Automation Letters 7, 3858--3865.
- Automatic operating room surgical activity recognition for robot-assisted surgery, in: Martel, A.L., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D., Joskowicz, L. (Eds.), Medical Image Computing and Computer Assisted Intervention -- MICCAI 2020, Springer International Publishing, Cham. pp. 385--395.
- Rendezvous in time: an attention-based temporal fusion approach for surgical triplet recognition. International Journal of Computer Assisted Radiology and Surgery , 1--7.
- Surgical action triplet detection by mixed supervised learning of instrument-tissue interactions, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 505--514.
- Automatic instrument segmentation in robot-assisted surgery using deep learning, in: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 624--628.
- Hollywood in homes: Crowdsourcing data collection for activity understanding, in: Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part I 14, Springer. pp. 510--526.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 .
- The tum lapchole dataset for the m2cai 2016 workflow challenge. arXiv:1610.09278.
- Comparison of 3d surgical tool segmentation procedures with robot kinematics prior, in: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4411--4418. doi:10.1109/IROS.2018.8594428.
- Real-time deep learning semantic segmentation during intra-operative surgery for 3d augmented reality assistance. International Journal of Computer Assisted Radiology and Surgery 16, 1435–1445. URL: http://dx.doi.org/10.1007/s11548-021-02432-y, doi:10.1007/s11548-021-02432-y.
- Label Studio: Data labeling software. URL: https://github.com/heartexlabs/label-studio. open source software available from https://github.com/heartexlabs/label-studio.
- Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging 36, 86--97.
- Towards holistic surgical scene understanding, in: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (Eds.), Medical Image Computing and Computer Assisted Intervention -- MICCAI 2022, Springer Nature Switzerland, Cham. pp. 442--452.
- Attention is all you need. Advances in neural information processing systems 30.
- Robotic gastrointestinal surgery: learning curve, educational programs and outcomes. Updates in Surgery 73, 799--814. URL: https://doi.org/10.1007/s13304-021-00973-0, doi:10.1007/s13304-021-00973-0.
- Comparative validation of machine learning algorithms for surgical workflow and skill analysis with the heichole benchmark. Medical Image Analysis 86, 102770. URL: https://www.sciencedirect.com/science/article/pii/S1361841523000312, doi:https://doi.org/10.1016/j.media.2023.102770.
- Neural rendering for stereo 3d reconstruction of deformable tissues in robotic surgery, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 431--441.
- Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision 126, 375--389.
- Discriminative subvolume search for efficient action detection, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE. pp. 2442--2449.
- DINO: DETR with improved denoising anchor boxes for end-to-end object detection, in: The Eleventh International Conference on Learning Representations. URL: https://openreview.net/forum?id=3mRwyG5one.
- Large-scale surgical workflow segmentation for laparoscopic sacrocolpopexy. International Journal of Computer Assisted Radiology and Surgery , 1--11.
- Hacs: Human action clips and segments dataset for recognition and temporal localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8668--8678.
- Trasetr: Track-to-segment transformer with contrastive query for instance-level instrument segmentation in robotic surgery, in: 2022 International Conference on Robotics and Automation (ICRA), pp. 11186--11193. doi:10.1109/ICRA46639.2022.9811873.
- Deformable {detr}: Deformable transformers for end-to-end object detection, in: International Conference on Learning Representations. URL: https://openreview.net/forum?id=gZ9hCDWe6ke.
- Deepphase: Surgical phase recognition in cataracts videos, in: Medical Image Computing and Computer Assisted Intervention – MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part IV, Springer-Verlag, Berlin, Heidelberg. p. 265–272. URL: https://doi.org/10.1007/978-3-030-00937-3_31, doi:10.1007/978-3-030-00937-3_31.