Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Friends Across Time: Multi-Scale Action Segmentation Transformer for Surgical Phase Recognition (2401.11644v1)

Published 22 Jan 2024 in cs.CV and cs.RO

Abstract: Automatic surgical phase recognition is a core technology for modern operating rooms and online surgical video assessment platforms. Current state-of-the-art methods use both spatial and temporal information to tackle the surgical phase recognition task. Building on this idea, we propose the Multi-Scale Action Segmentation Transformer (MS-AST) for offline surgical phase recognition and the Multi-Scale Action Segmentation Causal Transformer (MS-ASCT) for online surgical phase recognition. We use ResNet50 or EfficientNetV2-M for spatial feature extraction. Our MS-AST and MS-ASCT can model temporal information at different scales with multi-scale temporal self-attention and multi-scale temporal cross-attention, which enhances the capture of temporal relationships between frames and segments. We demonstrate that our method can achieve 95.26% and 96.15% accuracy on the Cholec80 dataset for online and offline surgical phase recognition, respectively, which achieves new state-of-the-art results. Our method can also achieve state-of-the-art results on non-medical datasets in the video action segmentation domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. L. S. Feldman, A. D. Pryor, A. K. Gardner, B. J. Dunkin, L. Schultz, M. M. Awad, and E. M. Ritter, “Sages video-based assessment (vba) program: a vision for life-long learning for surgeons,” Surgical endoscopy, vol. 34, no. 8, pp. 3285–3288, 2020.
  2. A. P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. De Mathelin, and N. Padoy, “Endonet: a deep architecture for recognition tasks on laparoscopic videos,” IEEE TMI, vol. 36, no. 1, pp. 86–97, 2016.
  3. Y. Jin, Q. Dou, H. Chen, L. Yu, J. Qin, C.-W. Fu, and P.-A. Heng, “Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network,” IEEE TMI, vol. 37, no. 5, pp. 1114–1126, 2017.
  4. Y. Jin, H. Li, Q. Dou, H. Chen, J. Qin, C.-W. Fu, and P.-A. Heng, “Multi-task recurrent convolutional network with correlation loss for surgical video analysis,” Medical image analysis, vol. 59, p. 101572, 2020.
  5. T. Czempiel, M. Paschali, M. Keicher, W. Simson, H. Feussner, S. T. Kim, and N. Navab, “Tecno: Surgical phase recognition with multi-stage temporal convolutional networks,” in MICCAI.   Springer, 2020, pp. 343–352.
  6. Y. A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convolutional network for action segmentation,” in CVPR, 2019, pp. 3575–3584.
  7. B. Zhang, A. Ghanem, A. Simes, H. Choi, A. Yoo, and A. Min, “Swnet: Surgical workflow recognition with deep convolutional network,” in MIDL.   PMLR, 2021, pp. 855–869.
  8. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
  9. F. Yi, H. Wen, and T. Jiang, “Asformer: Transformer for action segmentation,” in BMVC, 2021, p. 236.
  10. T. Czempiel, M. Paschali, D. Ostler, S. T. Kim, B. Busam, and N. Navab, “Opera: Attention-regularized transformers for surgical phase recognition,” in MICCAI.   Springer, 2021, pp. 604–614.
  11. B. Zhang, J. Abbing, A. Ghanem, D. Fer, J. Barker, R. Abukhalil, V. K. Goel, and F. Milletarì, “Towards accurate surgical workflow recognition with convolutional networks and transformers,” CMBBE: Imaging & Visualization, vol. 10, no. 4, pp. 349–356, 2022.
  12. Y. Jin, Y. Long, X. Gao, D. Stoyanov, Q. Dou, and P.-A. Heng, “Trans-svnet: hybrid embedding aggregation transformer for surgical workflow analysis,” IJCARS, pp. 1–10, 2022.
  13. H.-B. Chen, Z. Li, P. Fu, Z.-L. Ni, and G.-B. Bian, “Spatio-temporal causal transformer for multi-grained surgical phase recognition,” in EMBC.   IEEE, 2022, pp. 1663–1666.
  14. B. Zhang, M. H. Sarhan, B. Goel, S. Petculescu, and A. Ghanem, “Sf-tmn: Slowfast temporal modeling network for surgical phase recognition,” arXiv preprint arXiv:2306.08859, 2023.
  15. B. Zhang, D. Sturgeon, A. R. Shankar, V. K. Goel, J. Barker, A. Ghanem, P. Lee, M. Milecky, N. Stottler, and S. Petculescu, “Surgical instrument recognition for instrument usage documentation and surgical video library indexing,” CMBBE: Imaging & Visualization, vol. 11, no. 4, pp. 1064–1072, 2023.
  16. S. Stein and S. J. McKenna, “Combining embedded accelerometers with computer vision for recognizing food preparation activities,” in Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, 2013, pp. 729–738.
  17. A. Fathi, X. Ren, and J. M. Rehg, “Learning to recognize objects in egocentric activities,” in CVPR.   IEEE, 2011, pp. 3281–3288.
  18. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  19. M. Tan and Q. Le, “Efficientnetv2: Smaller models and faster training,” in ICML.   PMLR, 2021, pp. 10 096–10 106.
  20. Y. Jin, Y. Long, C. Chen, Z. Zhao, Q. Dou, and P.-A. Heng, “Temporal memory relation network for workflow recognition from surgical video,” IEEE TMI, vol. 40, no. 7, pp. 1911–1923, 2021.
  21. Z. He, A. Mottaghi, A. Sharghi, M. A. Jamal, and O. Mohareri, “An empirical study on activity recognition in long surgical videos,” in Machine Learning for Health.   PMLR, 2022, pp. 356–372.
  22. B. Zhang, A. Fung, M. Torabi, J. Barker, G. Foley, R. Abukhalil, M. L. Gaddis, and S. Petculescu, “C-ect: Online surgical phase recognition with cross-enhancement causal transformer,” in ISBI.   IEEE, 2023, pp. 1–5.
  23. J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in CVPR, 2017, pp. 6299–6308.
  24. C. Lea, R. Vidal, and G. D. Hager, “Learning convolutional action primitives for fine-grained action recognition,” in ICRA.   IEEE, 2016, pp. 1642–1649.
  25. C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in CVPR, 2017, pp. 156–165.
  26. A. Kadkhodamohammadi, I. Luengo, and D. Stoyanov, “Patg: position-aware temporal graph networks for surgical phase recognition on laparoscopic videos,” IJCARS, vol. 17, no. 5, pp. 849–856, 2022.
  27. A. P. Twinanda, “Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos,” Ph.D. dissertation, Strasbourg, 2017.
  28. F. Yi and T. Jiang, “Hard frame detection and online mapping for surgical phase recognition,” in MICCAI.   Springer, 2019, pp. 449–457.
  29. X. Gao, Y. Jin, Y. Long, Q. Dou, and P.-A. Heng, “Trans-svnet: accurate phase recognition from surgical videos via hybrid embedding aggregation transformer,” in MICCAI.   Springer, 2021, pp. 593–603.
  30. X. Ding and X. Li, “Exploring segment-level semantics for online phase recognition from surgical videos,” IEEE TMI, 2022.
  31. B. Zhang, B. Goel, M. H. Sarhan, V. K. Goel, R. Abukhalil, B. Kalesan, N. Stottler, and S. Petculescu, “Surgical workflow recognition with temporal convolution and transformer for action segmentation,” IJCARS, pp. 1–10, 2022.
  32. X. Ding, X. Yan, Z. Wang, W. Zhao, J. Zhuang, X. Xu, and X. Li, “Less is more: Surgical phase recognition from timestamp supervision,” IEEE TMI, vol. 42, no. 6, pp. 1897–1910, 2023.
  33. F. Yi, Y. Yang, and T. Jiang, “Not end-to-end: Explore multi-stage architecture for online surgical phase recognition,” in ACCV, 2022, pp. 2613–2628.
  34. N. Behrmann, S. A. Golestaneh, Z. Kolter, J. Gall, and M. Noroozi, “Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation,” in ECCV, 2022, pp. 52–68.

Summary

We haven't generated a summary for this paper yet.