LoViT: Long Video Transformer for Surgical Phase Recognition (2305.08989v3)
Abstract: Online surgical phase recognition plays a significant role towards building contextual tools that could quantify performance and oversee the execution of surgical workflows. Current approaches are limited since they train spatial feature extractors using frame-level supervision that could lead to incorrect predictions due to similar frames appearing at different phases, and poorly fuse local and global features due to computational constraints which can affect the analysis of long videos commonly encountered in surgical interventions. In this paper, we present a two-stage method, called Long Video Transformer (LoViT) for fusing short- and long-term temporal information that combines a temporally-rich spatial feature extractor and a multi-scale temporal aggregator consisting of two cascaded L-Trans modules based on self-attention, followed by a G-Informer module based on ProbSparse self-attention for processing global temporal information. The multi-scale temporal head then combines local and global features and classifies surgical phases using phase transition-aware supervision. Our approach outperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently. Compared to Trans-SVNet, LoViT achieves a 2.4 pp (percentage point) improvement in video-level accuracy on Cholec80 and a 3.1 pp improvement on AutoLaparo. Moreover, it achieves a 5.3 pp improvement in phase-level Jaccard on AutoLaparo and a 1.55 pp improvement on Cholec80. Our results demonstrate the effectiveness of our approach in achieving state-of-the-art performance of surgical phase recognition on two datasets of different surgical procedures and temporal sequencing characteristics whilst introducing mechanisms that cope with long videos.
- L. Maier-Hein, M. Eisenmann, and S. Speidel, “Surgical data science - from concepts to clinical translation,” CoRR, vol. abs/2011.02284, 2020.
- T. Vercauteren, M. Unberath, N. Padoy, and N. Navab, “CAI4CAI: the rise of contextual artificial intelligence in computer-assisted interventions,” Proc. IEEE, vol. 108, no. 1, pp. 198–214, 2020.
- C. R. Garrow, K.-F. Kowalewski, L. Li, M. Wagner, M. W. Schmidt, S. Engelhardt, D. A. Hashimoto, H. G. Kenngott, S. Bodenstedt, S. Speidel et al., “Machine learning for surgical phase recognition: a systematic review,” Annals of surgery, vol. 273, no. 4, pp. 684–693, 2021.
- G. Quellec, M. Lamard, B. Cochener, and G. Cazuguel, “Real-time task recognition in cataract surgery videos using adaptive spatiotemporal polynomials,” IEEE Trans. Medical Imaging, vol. 34, no. 4, pp. 877–887, 2015.
- O. Dergachyova, D. Bouget, A. Huaulmé, X. Morandi, and P. Jannin, “Automatic data-driven real-time segmentation and recognition of surgical workflow,” Int. J. Comput. Assist. Radiol. Surg., vol. 11, no. 6, pp. 1081–1089, 2016.
- T. Blum, H. Feußner, and N. Navab, “Modeling and segmentation of surgical workflow from laparoscopic video,” in Medical Image Computing and Computer-Assisted Intervention - MICCAI 2010, 13th International Conference, Beijing, China, September 20-24, 2010, Proceedings, Part III, 2010, pp. 400–407.
- N. Padoy, T. Blum, S. Ahmadi, H. Feußner, M. Berger, and N. Navab, “Statistical modeling and recognition of surgical workflow,” Medical Image Anal., vol. 16, no. 3, pp. 632–641, 2012.
- H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 1, pp. 43–49, 1978.
- L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, 1989.
- J. E. Bardram, A. Doryab, R. M. Jensen, P. M. Lange, K. L. G. Nielsen, and S. T. Petersen, “Phase recognition during surgical procedures using embedded and body-worn sensors,” in Ninth Annual IEEE International Conference on Pervasive Computing and Communications, PerCom 2011, 21-25 March 2011, Seattle, WA, USA, Proceedings, 2011, pp. 45–53.
- M. S. H. et al., “Feasibility of real-time workflow segmentation for tracked needle interventions,” IEEE Trans. Biomed. Eng., vol. 61, no. 6, pp. 1720–1728, 2014.
- Y. J. et al., “Sv-rcnet: Workflow recognition from surgical videos using recurrent convolutional network,” IEEE Trans. Medical Imaging, vol. 37, no. 5, pp. 1114–1126, 2018.
- X. Gao, Y. Jin, Y. Long, Q. Dou, and P. Heng, “Trans-svnet: Accurate phase recognition from surgical videos via hybrid embedding aggregation transformer,” in Medical Image Computing and Computer Assisted Intervention - MICCAI 2021 - 24th International Conference, Strasbourg, France, September 27 - October 1, 2021, Proceedings, Part IV, 2021, pp. 593–603.
- A. P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. de Mathelin, and N. Padoy, “Endonet: A deep architecture for recognition tasks on laparoscopic videos,” IEEE Trans. Medical Imaging, vol. 36, no. 1, pp. 86–97, 2017.
- Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- A. P. Twinanda, “Vision-based approaches for surgical activity recognition using laparoscopic and RBGD videos. (approches basées vision pour la reconnaissance d’activités chirurgicales à partir de vidéos laparoscopiques et multi-vues RGBD),” Ph.D. dissertation, University of Strasbourg, France, 2017.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- Y. J. et al., “Multi-task recurrent convolutional network with correlation loss for surgical video analysis,” Medical Image Anal., vol. 59, 2020.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 770–778.
- F. Yi and T. Jiang, “Hard frame detection and online mapping for surgical phase recognition,” in Medical Image Computing and Computer Assisted Intervention - MICCAI 2019 - 22nd International Conference, Shenzhen, China, October 13-17, 2019, Proceedings, Part V, 2019, pp. 449–457.
- X. Gao, Y. Jin, Q. Dou, and P. Heng, “Automatic gesture recognition in robot-assisted surgery with reinforcement learning and tree search,” in 2020 IEEE International Conference on Robotics and Automation, ICRA 2020, Paris, France, May 31 - August 31, 2020, 2020, pp. 8440–8446.
- T. C. et al., “Tecno: Surgical phase recognition with multi-stage temporal convolutional networks,” in Medical Image Computing and Computer Assisted Intervention - MICCAI 2020 - 23rd International Conference, Lima, Peru, October 4-8, 2020, Proceedings, Part III, 2020, pp. 343–352.
- Y. Jin, Y. Long, C. Chen, Z. Zhao, Q. Dou, and P. Heng, “Temporal memory relation network for workflow recognition from surgical video,” IEEE Trans. Medical Imaging, vol. 40, no. 7, pp. 1911–1923, 2021.
- C. Lea, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks: A unified approach to action segmentation,” in Computer Vision - ECCV 2016 Workshops - Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III, 2016, pp. 47–54.
- Y. A. Farha and J. Gall, “MS-TCN: multi-stage temporal convolutional network for action segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, 2019, pp. 3575–3584.
- A. V. et al., “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008.
- A. D. et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021.
- G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” in Proceedings of the International Conference on Machine Learning (ICML), July 2021.
- A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lucic, and C. Schmid, “Vivit: A video vision transformer,” in 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, 2021, pp. 6816–6826.
- R. Girdhar and K. Grauman, “Anticipative video transformer,” in 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, 2021, pp. 13 485–13 495.
- D. Damen, H. Doughty, G. M. Farinella, , A. Furnari, J. Ma, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray, “Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,” International Journal of Computer Vision (IJCV), vol. 130, p. 33–55, 2022.
- T. Czempiel, M. Paschali, D. Ostler, S. T. Kim, B. Busam, and N. Navab, “Opera: Attention-regularized transformers for surgical phase recognition,” in Medical Image Computing and Computer Assisted Intervention - MICCAI 2021 - 24th International Conference, Strasbourg, France, September 27 - October 1, 2021, Proceedings, Part IV, 2021, pp. 604–614.
- H. Z. et al., “Informer: Beyond efficient transformer for long sequence time-series forecasting,” in Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, 2021, pp. 11 106–11 115.
- Z. Wang, B. Lu, Y. Long, F. Zhong, T.-H. Cheung, Q. Dou, and Y. Liu, “Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy,” in Medical Image Computing and Computer Assisted Intervention - MICCAI 2022, 2022.
- A. Paszke, S. Gross, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, 2019, pp. 8024–8035.
- J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, 2009, pp. 248–255.
- P. G. et al., “Accurate, large minibatch SGD: training imagenet in 1 hour,” CoRR, vol. abs/1706.02677, 2017.
- A. P. Twinanda, D. Mutter, J. Marescaux, M. de Mathelin, and N. Padoy, “Single- and multi-task architectures for surgical workflow challenge at M2CAI 2016,” CoRR, vol. abs/1610.08844, 2016.
- H. Hotelling, “Analysis of a complex of statistical variables into principal components.” Journal of educational psychology, vol. 24, no. 6, p. 417, 1933.