SAR-RARP50: Segmentation of surgical instrumentation and Action Recognition on Robot-Assisted Radical Prostatectomy Challenge (2401.00496v2)
Abstract: Surgical tool segmentation and action recognition are fundamental building blocks in many computer-assisted intervention applications, ranging from surgical skills assessment to decision support systems. Nowadays, learning-based action recognition and segmentation approaches outperform classical methods, relying, however, on large, annotated datasets. Furthermore, action recognition and tool segmentation algorithms are often trained and make predictions in isolation from each other, without exploiting potential cross-task relationships. With the EndoVis 2022 SAR-RARP50 challenge, we release the first multimodal, publicly available, in-vivo, dataset for surgical action recognition and semantic instrumentation segmentation, containing 50 suturing video segments of Robotic Assisted Radical Prostatectomy (RARP). The aim of the challenge is twofold. First, to enable researchers to leverage the scale of the provided dataset and develop robust and highly accurate single-task action recognition and tool segmentation approaches in the surgical domain. Second, to further explore the potential of multitask-based learning approaches and determine their comparative advantage against their single-task counterparts. A total of 12 teams participated in the challenge, contributing 7 action recognition methods, 9 instrument segmentation techniques, and 4 multitask approaches that integrated both action recognition and instrument segmentation. The complete SAR-RARP50 dataset is available at: https://rdr.ucl.ac.uk/projects/SARRARP50_Segmentation_of_surgical_instrumentation_and_Action_Recognition_on_Robot-Assisted_Radical_Prostatectomy_Challenge/191091
- 2018 robotic scene segmentation challenge. arXiv preprint arXiv:2001.11190 .
- 2017 robotic instrument segmentation challenge. arXiv preprint arXiv:1902.06426 .
- Matis: Masked-attention transformers for surgical instrument segmentation, in: 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), IEEE.
- Computer vision in the surgical operating room. Visceral Medicine 36, 456–462.
- Linknet: Exploiting encoder representations for efficient semantic segmentation, in: 2017 IEEE visual communications and image processing (VCIP), IEEE. pp. 1–4.
- Masked-attention mask transformer for universal image segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1290–1299.
- ” xception: Deep learning with depthwise separable convolutions”, arxiv preprint. arXiv preprint arXiv:1610.02357 .
- A comparative analysis of multi-backbone mask r-cnn for surgical tools detection, in: 2020 International Joint Conference on Neural Networks (IJCNN), IEEE. pp. 1–8.
- Synthetic and real inputs for tool segmentation in robotic surgery, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 700–710.
- Ssis-seg: Simulation-supervised image synthesis for surgical instrument segmentation. IEEE Transactions on Medical Imaging 41, 3074–3086.
- I-divergence geometry of probability distributions and minimization problems. The annals of probability , 146–158.
- Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee. pp. 248–255.
- Knowledge-based support for surgical workflow analysis and recognition. Ph.D. thesis. Université Rennes 1.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 .
- Multiscale vision transformers, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6824–6835.
- Ms-tcn: Multi-stage temporal convolutional network for action segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3575–3584.
- Multi-class detection of laparoscopic instruments for the intelligent box-trainer system using faster r-cnn architecture, in: 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), IEEE. pp. 000149–000154.
- Slowfast networks for video recognition, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6202–6211.
- Trans-svnet: Accurate phase recognition from surgical videos via hybrid embedding aggregation transformer, in: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, Springer. pp. 593–603.
- Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling, in: MICCAI workshop: M2cai.
- Toolnet: holistically-nested real-time segmentation of robotic surgical tools, in: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE. pp. 5717–5722.
- Deep residual learning for image recognition. corr abs/1512.03385 (2015).
- Long short-term memory. Neural computation 9, 1735–1780.
- Learning where to look while tracking instruments in robot-assisted surgery, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 412–420.
- Incorporating temporal prior from motion flow for instrument segmentation in minimally invasive surgery video, in: Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part V 22, Springer. pp. 440–448.
- Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37, 1114–1126.
- Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572.
- Patg: position-aware temporal graph networks for surgical phase recognition on laparoscopic videos. International Journal of Computer Assisted Radiology and Surgery 17, 849–856.
- Co-generation and segmentation for generalized surgical instrument segmentation on unlabelled data, in: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, Springer. pp. 403–412.
- Surgical tool segmentation and localization using spatio-temporal deep network, in: 2020 42nd annual international conference of the IEEE engineering in medicine & biology society (EMBC), IEEE. pp. 1658–1661.
- Adaptive t-vmf dice loss for multi-class medical image segmentation. arXiv preprint arXiv:2207.07842 .
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 .
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
- Temporal convolutional networks: A unified approach to action segmentation, in: Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14, Springer. pp. 47–54.
- Bridge-prompt: Towards ordinal action understanding in instructional videos, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889.
- Feature pyramid networks for object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125.
- Microsoft coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer. pp. 740–755.
- On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265 .
- Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 .
- Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022.
- Video swin transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3202–3211.
- Relational graph learning on visual and kinematics embeddings for accurate gesture recognition in robotic surgery, in: 2021 IEEE International Conference on Robotics and Automation (ICRA), IEEE. pp. 13346–13353.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 .
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 .
- Why rankings of biomedical image analysis competitions should be interpreted with care. Nature communications 9, 5217.
- A dvrk-based framework for surgical subtask automation. Acta Polytechnica Hungarica , 61–78.
- Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos. Medical Image Analysis 78, 102433.
- Msdesis: Multitask stereo disparity estimation and surgical instrument segmentation. IEEE transactions on medical imaging 41, 3218–3230.
- davincinet: Joint prediction of motion and surgical state in robot-assisted surgery, in: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE. pp. 2921–2928.
- U-net: Convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, Springer. pp. 234–241.
- Simulation-to-real domain adaptation with teacher–student learning for endoscopic instrument segmentation. International journal of computer assisted radiology and surgery 16, 849–859.
- Robust deep learning-based semantic organ segmentation in hyperspectral images. Medical Image Analysis 80, 102488.
- Transformers in medical imaging: A survey. Medical Image Analysis , 102802.
- Automatic instrument segmentation in robot-assisted surgery using deep learning, in: 2018 17th IEEE international conference on machine learning and applications (ICMLA), IEEE. pp. 624–628.
- Combining embedded accelerometers with computer vision for recognizing food preparation activities, in: Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, pp. 729–738.
- Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations, in: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3, Springer. pp. 240–248.
- High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514 .
- Efficientnet: Rethinking model scaling for convolutional neural networks, in: International conference on machine learning, PMLR. pp. 6105–6114.
- Efficientnetv2: Smaller models and faster training, in: International conference on machine learning, PMLR. pp. 10096–10106.
- Towards holistic surgical scene understanding, in: International conference on medical image computing and computer-assisted intervention, Springer. pp. 442–452.
- Gesture recognition in robotic surgery with multimodal attention. IEEE Transactions on Medical Imaging 41, 1677–1687.
- Attention is all you need. Advances in neural information processing systems 30.
- Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 .
- Towards accurate and interpretable surgical skill assessment: a video-based method for skill score prediction and guiding feedback generation. International Journal of Computer Assisted Radiology and Surgery 16, 1595–1605.
- Neural rendering for stereo 3d reconstruction of deformable tissues in robotic surgery, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 431–441.
- Ranger21: a synergistic deep learning optimizer. arXiv preprint arXiv:2106.13731 .
- Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems 34, 12077–12090.
- Upsnet: A unified panoptic segmentation network, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8818–8826.
- Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 .
- Object-contextual representations for semantic segmentation, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, Springer. pp. 173–190.
- Trasetr: track-to-segment transformer with contrastive query for instance-level instrument segmentation in robotic surgery, in: 2022 International Conference on Robotics and Automation (ICRA), IEEE. pp. 11186–11193.
- Informer: Beyond efficient transformer for long sequence time-series forecasting, in: Proceedings of the AAAI conference on artificial intelligence, pp. 11106–11115.
- Unet++: A nested u-net architecture for medical image segmentation, in: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, Springer. pp. 3–11.