Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Instrument-tissue Interaction Detection Framework for Surgical Video Understanding (2404.00322v1)

Published 30 Mar 2024 in cs.CV

Abstract: Instrument-tissue interaction detection task, which helps understand surgical activities, is vital for constructing computer-assisted surgery systems but with many challenges. Firstly, most models represent instrument-tissue interaction in a coarse-grained way which only focuses on classification and lacks the ability to automatically detect instruments and tissues. Secondly, existing works do not fully consider relations between intra- and inter-frame of instruments and tissues. In the paper, we propose to represent instrument-tissue interaction as <instrument class, instrument bounding box, tissue class, tissue bounding box, action class> quintuple and present an Instrument-Tissue Interaction Detection Network (ITIDNet) to detect the quintuple for surgery videos understanding. Specifically, we propose a Snippet Consecutive Feature (SCF) Layer to enhance features by modeling relationships of proposals in the current frame using global context information in the video snippet. We also propose a Spatial Corresponding Attention (SCA) Layer to incorporate features of proposals between adjacent frames through spatial encoding. To reason relationships between instruments and tissues, a Temporal Graph (TG) Layer is proposed with intra-frame connections to exploit relationships between instruments and tissues in the same frame and inter-frame connections to model the temporal information for the same instance. For evaluation, we build a cataract surgery video (PhacoQ) dataset and a cholecystectomy surgery video (CholecQ) dataset. Experimental results demonstrate the promising performance of our model, which outperforms other state-of-the-art models on both datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. L. Maier-Hein, S. S. Vedula, S. Speidel, N. Navab, R. Kikinis, A. Park, M. Eisenmann, H. Feussner, G. Forestier, S. Giannarou et al., “Surgical data science for next-generation interventions,” Nature Biomedical Engineering, vol. 1, no. 9, pp. 691–696, 2017.
  2. W. Lin, Y. Hu, L. Hao, D. Zhou, M. Yang, H. Fu, C. Chui, and J. Liu, “Instrument-tissue interaction quintuple detection in surgery videos,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2022, pp. 399–409.
  3. X. B. Feng and D. H. Fu, “Application of computer assisted navigation system in orthopaedics,” Journal of Clinical Rehabilitative Tissue Engineering Research, vol. 13, no. 30, pp. 5935–5938, 2009.
  4. C. I. Nwoye, C. Gonzalez, T. Yu, P. Mascagni, D. Mutter, J. Marescaux, and N. Padoy, “Recognition of instrument-tissue interactions in endoscopic videos via action triplets,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2020, pp. 364–374.
  5. M. Islam, L. Seenivasan, L. C. Ming, and H. Ren, “Learning and reasoning with the graph structure representation in robotic surgery,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2020, pp. 627–636.
  6. M. Xu, M. Islam, C. M. Lim, and H. Ren, “Learning domain adaptation with model calibration for surgical report generation in robotic surgery,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 12 350–12 356.
  7. L. Seenivasan, S. Mitheran, M. Islam, and H. Ren, “Global-reasoned multi-task learning model for surgical scene understanding,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 3858–3865, 2022.
  8. C. I. Nwoye, T. Yu, C. Gonzalez, B. Seeliger, P. Mascagni, D. Mutter, J. Marescaux, and N. Padoy, “Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos,” Medical Image Analysis, vol. 78, p. 102433, 2022.
  9. L. Li, X. Li, S. Ding, Z. Fang, M. Xu, H. Ren, and S. Yang, “Sirnet: Fine-grained surgical interaction recognition,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 4212–4219, 2022.
  10. D. Sarikaya, J. J. Corso, and K. A. Guru, “Detection and localization of robotic tools in robot-assisted surgery videos using deep neural networks for region proposal and detection,” IEEE Transactions on Medical Imaging, vol. 36, no. 7, pp. 1542–1549, 2017.
  11. C. Lea, G. D. Hager, and R. Vidal, “An improved model for segmentation and recognition of fine-grained activities with application to surgical training tasks,” in 2015 IEEE Winter Conference on Applications of Computer Vision, 2015, pp. 1123–1129.
  12. T. Khatibi and P. Dezyani, “Proposing novel methods for gynecologic surgical action recognition on laparoscopic videos,” Multimedia Tools and Applications, vol. 79, no. 41, pp. 30 111–30 133, 2020.
  13. C. Lea, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks: A unified approach to action segmentation,” in European conference on computer vision.   Springer, 2016, pp. 47–54.
  14. R. DiPietro, C. Lea, A. Malpani, N. Ahmidi, S. S. Vedula, G. I. Lee, M. R. Lee, and G. D. Hager, “Recognizing surgical activities with recurrent neural networks,” in International conference on medical image computing and computer-assisted intervention.   Springer, 2016, pp. 551–558.
  15. V. S. Bawa, G. Singh, F. KapingA, I. Skarga-Bandurova, E. Oleari, A. Leporini, C. Landolfo, P. Zhao, X. Xiang, G. Luo, K. Wang, L. Li, B. Wang, S. Zhao, L. Li, A. Stabile, F. Setti, R. Muradore, and F. Cuzzolin, “The saras endoscopic surgeon action detection (esad) dataset: Challenges and methods,” ArXiv, vol. abs/2104.03178, 2021.
  16. S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu, “Learning human-object interactions by graph parsing neural networks,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 401–417.
  17. W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.
  18. Y.-W. Chao, Z. Wang, Y. He, J. Wang, and J. Deng, “Hico: A benchmark for recognizing human-object interactions in images,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1017–1025.
  19. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  20. M. Tamura, H. Ohashi, and T. Yoshinaga, “Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 10 405–10 414.
  21. C. Zou, B. Wang, Y. Hu, J. Liu, Q. Wu, Y. Zhao, B. Li, C. Zhang, C. Zhang, Y. Wei, and J. Sun, “End-to-end human object interaction detection with hoi transformer,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 11 820–11 829.
  22. F. Z. Zhang, D. Campbell, and S. Gould, “Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 104–20 112.
  23. L. Maier-Hein, M. Wagner, T. Ross, A. Reinke, and B. P. Müller-Stich, “Heidelberg colorectal data set for surgical data science in the sensor operating room,” Scientific Data, vol. 8, no. 1, p. 101, 2021.
  24. A. P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. de Mathelin, and N. Padoy, “Endonet: A deep architecture for recognition tasks on laparoscopic videos,” IEEE Transactions on Medical Imaging, vol. 36, no. 1, pp. 86–97, 2017.
  25. A. Huaulmé, D. Sarikaya, K. Le Mut, F. Despinoy, Y. Long, Q. Dou, C.-B. Chng, W. Lin, S. Kondo, L. Bravo-Sánchez et al., “Micro-surgical anastomose workflow recognition challenge report,” Computer Methods and Programs in Biomedicine, vol. 212, p. 106452, 2021.
  26. S. Ramesh, D. Dall’Alba, C. Gonzalez, T. Yu, P. Mascagni, D. Mutter, J. Marescaux, P. Fiorini, and N. Padoy, “Multi-task temporal convolutional networks for joint recognition of surgical phases and steps in gastric bypass procedures,” International Journal of Computer Assisted Radiology and Surgery, vol. 16, no. 7, pp. 1111–1119, Jul. 2021.
  27. N. Valderrama, P. Ruiz Puentes, I. Hernández, N. Ayobi, M. Verlyck, J. Santander, J. Caicedo, N. Fernández, and P. Arbeláez, “Towards holistic surgical scene understanding,” in Medical Image Computing and Computer Assisted Intervention – MICCAI 2022, L. Wang, Q. Dou, P. T. Fletcher, S. Speidel, and S. Li, Eds.   Cham: Springer Nature Switzerland, 2022, pp. 442–452.
  28. V. S. Bawa, G. Singh, F. KapingA, InnaSkarga-Bandurova, A. Leporini, C. Landolfo, A. Stabile, F. Setti, R. Muradore, E. Oleari, and F. Cuzzolin, “Esad: Endoscopic surgeon action detection dataset,” ArXiv, vol. abs/2006.07164, 2020.
  29. S. Zhao, Y. Liu, Q. Wang, D. Sun, R. Liu, and S. Zhou, “Murphy: Relations matter in surgical workflow analysis,” ArXiv, vol. abs/2212.12719, 2022.
  30. W.-Y. Hong, C.-L. Kao, Y.-H. Kuo, J.-R. Wang, W.-L. Chang, and C.-S. Shih, “Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80,” arXiv preprint arXiv:2012.12453, 2020.
  31. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
  32. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 318–327, 2020.
  33. S. Gupta and J. Malik, “Visual semantic role labeling,” ArXiv, vol. abs/1505.04474, 2015.
  34. C. Gao, Y. Zou, and J.-B. Huang, “ican: Instance-centric attention network for human-object interaction detection,” in British Machine Vision Conference, 2018.
  35. F. Z. Zhang, D. Campbell, and S. Gould, “Spatially conditioned graphs for detecting human-object interactions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 319–13 327.
  36. J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” The Journal of Machine learning research, vol. 7, pp. 1–30, 2006.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com