Multimodal Transformers for Real-Time Surgical Activity Prediction (2403.06705v1)
Abstract: Real-time recognition and prediction of surgical activities are fundamental to advancing safety and autonomy in robot-assisted surgery. This paper presents a multimodal transformer architecture for real-time recognition and prediction of surgical gestures and trajectories based on short segments of kinematic and video data. We conduct an ablation study to evaluate the impact of fusing different input modalities and their representations on gesture recognition and prediction performance. We perform an end-to-end assessment of the proposed architecture using the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) dataset. Our model outperforms the state-of-the-art (SOTA) with 89.5\% accuracy for gesture prediction through effective fusion of kinematic features with spatial and contextual video features. It achieves the real-time performance of 1.1-1.3ms for processing a 1-second input window by relying on a computationally efficient model.
- A. Kumar, N. Yadav, S. Singh, and N. Chauhan, “Minimally invasive (endoscopic-computer assisted) surgery: Technique and review,” Annals of maxillofacial surgery, vol. 6, no. 2, p. 159, 2016.
- J. Finkelstein, E. Eckersberger, H. Sadri, S. S. Taneja, H. Lepor, and B. Djavan, “Open versus laparoscopic versus robot-assisted laparoscopic prostatectomy: the european and us experience,” Reviews in urology, vol. 12, no. 1, p. 35, 2010.
- S. Bonne, W. Panitch, K. Dharmarajan, K. Srinivas, J.-L. Kincade, T. Low, B. Knoth, C. Cowan, D. Fer, B. Thananjeyan et al., “A digital twin framework for telesurgery in the presence of varying network quality of service,” in 2022 IEEE 18th international conference on automation science and engineering (CASE). IEEE, 2022, pp. 1325–1332.
- G. Gonzalez, M. Balakuntala, M. Agarwal, T. Low, B. Knoth, A. W. Kirkpatrick, J. McKee, G. Hager, V. Aggarwal, Y. Xue et al., “Asap: A semi-autonomous precise system for telesurgery during communication delays,” IEEE Transactions on Medical Robotics and Bionics, vol. 5, no. 1, pp. 66–78, 2023.
- J. Han, J. Davids, H. Ashrafian, A. Darzi, D. S. Elson, and M. Sodergren, “A systematic review of robotic surgery: From supervised paradigms to fully autonomous robotic approaches,” The International Journal of Medical Robotics and Computer Assisted Surgery, vol. 18, no. 2, p. e2358, 2022.
- L. Maier-Hein, S. S. Vedula, S. Speidel, N. Navab, R. Kikinis, A. Park, M. Eisenmann, H. Feussner, G. Forestier, S. Giannarou, M. Hashizume, D. Katic, H. Kenngott, M. Kranzfelder, A. Malpani, K. März, T. Neumuth, N. Padoy, C. Pugh, N. Schoch, D. Stoyanov, R. Taylor, M. Wagner, G. D. Hager, and P. Jannin, “Surgical data science for next-generation interventions,” Nature Biomedical Engineering, vol. 1, no. 9, pp. 691–696, Sep. 2017. [Online]. Available: https://doi.org/10.1038/s41551-017-0132-7
- M. J. Fard, S. Ameri, R. Darin Ellis, R. B. Chinnam, A. K. Pandya, and M. D. Klein, “Automated robot-assisted surgical skill evaluation: Predictive analytics approach,” The International Journal of Medical Robotics and Computer Assisted Surgery, vol. 14, no. 1, p. e1850, 2018.
- C. E. Reiley, E. Plaku, and G. D. Hager, “Motion generation of robotic surgical tasks: Learning from expert demonstrations,” in 2010 Annual international conference of the IEEE engineering in medicine and biology. IEEE, 2010, pp. 967–970.
- A. Zia and I. Essa, “Automated surgical skill assessment in rmis training,” International journal of computer assisted radiology and surgery, vol. 13, pp. 731–739, 2018.
- M. S. Yasar, D. Evans, and H. Alemzadeh, “Context-aware monitoring in robotic surgery,” in 2019 International symposium on medical robotics (ISMR). IEEE, 2019, pp. 1–7.
- M. S. Yasar and H. Alemzadeh, “Real-time context-aware detection of unsafe events in robot-assisted surgery,” in 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 2020, pp. 385–397.
- Z. Li, K. Hutchinson, and H. Alemzadeh, “Runtime detection of executional errors in robot-assisted surgery,” in 2022 International conference on robotics and automation (ICRA). IEEE, 2022, pp. 3850–3856.
- B. van Amsterdam, M. J. Clarkson, and D. Stoyanov, “Gesture recognition in robotic surgery: a review,” IEEE Transactions on Biomedical Engineering, vol. 68, no. 6, 2021.
- K. Hutchinson, I. Reyes, Z. Li, and H. Alemzadeh, “Evaluating the task generalization of temporal convolutional networks for surgical gesture and motion recognition using kinematic data,” IEEE Robotics and Automation Letters, 2023.
- C. Shi, Y. Zheng, and A. M. Fey, “Recognition and prediction of surgical gestures and trajectories using transformer models in robot-assisted surgery,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 8017–8024.
- Y. A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convolutional network for action segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3575–3584.
- R. DiPietro, N. Ahmidi, A. Malpani, M. Waldram, G. I. Lee, M. R. Lee, S. S. Vedula, and G. D. Hager, “Segmenting and classifying activities in robot-assisted surgery with recurrent neural networks,” International journal of computer assisted radiology and surgery, vol. 14, no. 11, pp. 2005–2020, 2019.
- G. Menegozzo, D. Dall’Alba, C. Zandona, and P. Fiorini, “Surgical gesture recognition with time delay neural network based on kinematic data,” in 2019 International symposium on medical robotics (ISMR). IEEE, 2019, pp. 1–7.
- B. van Amsterdam, M. J. Clarkson, and D. Stoyanov, “Multi-task recurrent neural network for surgical gesture recognition and progress prediction,” in 2020 IEEE international conference on robotics and automation (ICRA). IEEE, 2020, pp. 1380–1386.
- C. Lea, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks: A unified approach to action segmentation,” in Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14. Springer, 2016, pp. 47–54.
- J. Zhang, Y. Nie, Y. Lyu, H. Li, J. Chang, X. Yang, and J. J. Zhang, “Symmetric dilated convolution for surgical gesture recognition,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23. Springer, 2020, pp. 409–418.
- I. Funke, S. Bodenstedt, F. Oehme, F. von Bechtolsheim, J. Weitz, and S. Speidel, “Using 3d convolutional neural networks to learn spatiotemporal features for automatic surgical gesture recognition in video,” in International conference on medical image computing and computer-assisted intervention. Springer, 2019, pp. 467–475.
- D. Sarikaya and P. Jannin, “Surgical gesture recognition with optical flow only,” arXiv preprint arXiv:1904.01143, 2019.
- Y. Qin, S. A. Pedram, S. Feyzabadi, M. Allan, A. J. McLeod, J. W. Burdick, and M. Azizian, “Temporal segmentation of surgical sub-tasks through deep learning with multiple data sources,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 371–377.
- B. Van Amsterdam, I. Funke, E. Edwards, S. Speidel, J. Collins, A. Sridhar, J. Kelly, M. J. Clarkson, and D. Stoyanov, “Gesture recognition in robotic surgery with multimodal attention,” IEEE Transactions on Medical Imaging, vol. 41, no. 7, pp. 1677–1687, 2022.
- F. Despinoy, D. Bouget, G. Forestier, C. Penet, N. Zemiti, P. Poignet, and P. Jannin, “Unsupervised trajectory segmentation for surgical gesture recognition in robotic training,” IEEE Transactions on Biomedical Engineering, vol. 63, no. 6, pp. 1280–1291, 2015.
- M. Hwang, J. Ichnowski, B. Thananjeyan, D. Seita, S. Paradis, D. Fer, T. Low, and K. Goldberg, “Automating surgical peg transfer: Calibration with deep learning can exceed speed, accuracy, and consistency of humans,” IEEE Transactions on Automation Science and Engineering, vol. 20, no. 2, pp. 909–922, Apr. 2023. [Online]. Available: https://doi.org/10.1109/tase.2022.3171795
- M. Ginesi, D. Meli, A. Roberti, N. Sansonetto, and P. Fiorini, “Autonomous task planning and situation awareness in robotic surgery,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 3144–3150.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- C. Lea, A. Reiter, R. Vidal, and G. D. Hager, “Segmental spatiotemporal cnns for fine-grained action segmentation,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer, 2016, pp. 36–52.
- K. Hutchinson, Z. Li, I. Reyes, and H. Alemzadeh, “Towards surgical context inference and translation to gestures,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 6802–6809.
- D. Neumuth, F. Loebe, H. Herre, and T. Neumuth, “Modeling surgical processes: A four-level translational approach,” Artificial intelligence in medicine, vol. 51, no. 3, pp. 147–161, 2011.
- K. Hutchinson, I. Reyes, Z. Li, and H. Alemzadeh, “Compass: a formal framework and aggregate dataset for generalized surgical procedure modeling,” International Journal of Computer Assisted Radiology and Surgery, pp. 1–12, 2023.
- Z. Li, I. Reyes, and H. Alemzadeh, “Robotic scene segmentation with memory network for runtime surgical context inference,” arXiv preprint arXiv:2308.12789, 2023.
- Y. Gao, S. S. Vedula, C. E. Reiley, N. Ahmidi, B. Varadarajan, H. C. Lin, L. Tao, L. Zappella, B. Béjar, D. D. Yuh et al., “Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling,” in MICCAI workshop: M2cai, vol. 3, no. 3, 2014.
- S. DiMaio, M. Hanuschik, and U. Kreaden, “The da vinci surgical system,” Surgical robotics: systems applications and visions, pp. 199–217, 2011.
- Y. Qin, S. F. Feyzabadi, M. Allan, J. W. Burdick, and M. Azizian, “davincinet: Joint prediction of motion and surgical state in robot-assisted surgery,” 2020.
- T. E. Murphy, “Towards objective surgical skill evaluation with hidden markov model-based motion recognition,” 2004.
- I. Gurcan and H. Van Nguyen, “Surgical activities recognition using multi-scale recurrent networks,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 2887–2891.
- C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 156–165.
- Z. Wang, B. Zi, H. Ding, W. You, and L. Yu, “Hybrid grey prediction model-based autotracking algorithm for the laparoscopic visual window of surgical robot,” Mechanism and Machine Theory, vol. 123, pp. 107–123, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0094114X18300107
- Y. Sun, B. Pan, Y. Fu, and G. Niu, “Visual-based autonomous field of view control of laparoscope with safety-RCM constraints for semi-autonomous surgery,” The International Journal of Medical Robotics and Computer Assisted Surgery, vol. 16, no. 2, Feb. 2020. [Online]. Available: https://doi.org/10.1002/rcs.2079
- C. Staub, C. Lenz, G. Panin, A. Knoll, and R. Bauernschmitt, “Contour-based surgical instrument tracking supported by kinematic prediction,” in 2010 3rd IEEE RAS & EMBS International Conference on Biomedical Robotics and Biomechatronics, 2010, pp. 746–752.
- M. M. Rahman, M. V. Balakuntala, G. Gonzalez, M. Agarwal, U. Kaur, V. L. N. Venkatesh, N. Sanchez-Tamayo, Y. Xue, R. M. Voyles, V. Aggarwal, and J. Wachs, “Sartres: a semi-autonomous robot teleoperation environment for surgery,” Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, vol. 9, no. 4, pp. 376–383, 2021.
- S. Bonne, W. Panitch, K. Dharmarajan, K. Srinivas, J.-L. Kincade, T. Low, B. Knoth, C. Cowan, D. Fer, B. Thananjeyan, J. Kerr, J. Ichnowski, and K. Goldberg, “A digital twin framework for telesurgery in the presence of varying network quality of service,” in 2022 IEEE 18th International Conference on Automation Science and Engineering (CASE), 2022, pp. 1325–1332.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- Q. Wen, T. Zhou, C. Zhang, W. Chen, Z. Ma, J. Yan, and L. Sun, “Transformers in time series: A survey,” arXiv preprint arXiv:2202.07125, 2022.
- G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and C. Eickhoff, “A transformer-based framework for multivariate time series representation learning,” in Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, 2021, pp. 2114–2124.
- M. Liu, S. Ren, S. Ma, J. Jiao, Y. Chen, Z. Wang, and W. Song, “Gated transformer networks for multivariate time series classification,” arXiv preprint arXiv:2103.14438, 2021.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, “Informer: Beyond efficient transformer for long sequence time-series forecasting,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 12, 2021, pp. 11 106–11 115.
- A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
- K. P. F.R.S., “Liii. on lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 2, no. 11, pp. 559–572, 1901.
- K. Patroumpas and T. Sellis, “Window specification over data streams,” in International Conference on Extending Database Technology. Springer, 2006, pp. 445–464.
- N. Ahmidi, L. Tao, S. Sefati, Y. Gao, C. Lea, B. B. Haro, L. Zappella, S. Khudanpur, R. Vidal, and G. D. Hager, “A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery,” IEEE Transactions on Biomedical Engineering, vol. 64, no. 9, pp. 2025–2041, 2017.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- L. Yujian and L. Bo, “A normalized levenshtein distance metric,” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 6, pp. 1091–1095, 2007.
- G. Quellec, K. Charrière, M. Lamard, Z. Droueche, C. Roux, B. Cochener, and G. Cazuguel, “Real-time recognition of surgical tasks in eye surgery videos,” Medical image analysis, vol. 18, no. 3, pp. 579–590, 2014.
- S. F. committee. (2019) Fundamentals of laparoscopic surgery. [Online]. Available: https://www.flsprogram.org/technical-skills-training-curriculum/
- Keshara Weerasinghe (2 papers)
- Seyed Hamid Reza Roodabeh (1 paper)
- Kay Hutchinson (8 papers)
- Homa Alemzadeh (28 papers)