Efficient Surgical Tool Recognition via HMM-Stabilized Deep Learning
Abstract: Recognizing various surgical tools, actions and phases from surgery videos is an important problem in computer vision with exciting clinical applications. Existing deep-learning-based methods for this problem either process each surgical video as a series of independent images without considering their dependence, or rely on complicated deep learning models to count for dependence of video frames. In this study, we revealed from exploratory data analysis that surgical videos enjoy relatively simple semantic structure, where the presence of surgical phases and tools can be well modeled by a compact hidden Markov model (HMM). Based on this observation, we propose an HMM-stabilized deep learning method for tool presence detection. A wide range of experiments confirm that the proposed approaches achieve better performance with lower training and running costs, and support more flexible ways to construct and utilize training data in scenarios where not all surgery videos of interest are extensively labelled. These results suggest that popular deep learning approaches with over-complicated model structures may suffer from inefficient utilization of data, and integrating ingredients of deep learning and statistical learning wisely may lead to more powerful algorithms that enjoy competitive performance, transparent interpretation and convenient model training simultaneously.
- Statistical Inference for Probabilistic Functions of Finite State Markov Chains. The Annals of Mathematical Statistics, 37(6): 1554–1563.
- A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains. Annals of Mathematical Statistics, 41(1): 164–171.
- Quality management in general surgery: a review of the literature. Journal of Acute Disease, 3(4): 253–257.
- Understanding performance problems in deep learning systems. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 357–369.
- Interpretability of deep learning models: A survey of results. In 2017 IEEE SmartWorld, Ubiquitous Intelligence and Computing, Advanced and Trusted Computed, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People and Smart City Innovation, 1–6.
- Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1): 1–22.
- ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 248–255.
- Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 580–587.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778.
- Tool detection and operative skill assessment in surgical videos using region-Based convolutional neural networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 691–699.
- Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical Image Analysis, 59(1): 1–14.
- Adam: A method for stochastic optimization. In 2015 Proceedings of International Conference on Learning Representations (ICLR), 1213–1231.
- A vision transformer for decoding surgeon activity from surgical videos. Nature Biomedical Engineering, 7(1): 780–796.
- Kondo, S. 2021. LapFormer: surgical tool detection in laparoscopic surgical video using transformer architecture. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 9(3): 302–307.
- Imagenet classification with deep convolutional neural networks. In 2012 Proceedings of Advances in neural information processing systems, 1097–1105.
- Swin transformer: hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 9992–10002.
- Online recognition of surgical instruments by information fusion. International Journal of Computer Assisted Radiology and Surgery, 7(1): 297–304.
- Weakly supervised convolutional LSTM approach for tool tracking in laparoscopic videos. International Journal of Computer Assisted Radiology and Surgery, 14(1): 1059–1067.
- Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos. Medical Image Analysis, 78(1): 1–18.
- Educational video recording and editing for the hand surgeon. The Journal of Hand Surgery, 40(5): 1048–1054.
- Faster R-CNN: towards real-Time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137–1149.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
- LAST: LAtent Space-Constrained Transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging, 42(11): 3256–3268.
- EndoNet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Transactions on Medical Imaging, 36(1): 86–97.
- Weakly-supervised learning for tool localization in laparoscopic videos. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, 169–179.
- Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000–6010.
- Viterbi, A. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2): 260–269.
- Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3): 328–339.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.