Self-Supervised Learning for Interventional Image Analytics: Towards Robust Device Trackers (2405.01156v1)
Abstract: An accurate detection and tracking of devices such as guiding catheters in live X-ray image acquisitions is an essential prerequisite for endovascular cardiac interventions. This information is leveraged for procedural guidance, e.g., directing stent placements. To ensure procedural safety and efficacy, there is a need for high robustness no failures during tracking. To achieve that, one needs to efficiently tackle challenges, such as: device obscuration by contrast agent or other external devices or wires, changes in field-of-view or acquisition angle, as well as the continuous movement due to cardiac and respiratory motion. To overcome the aforementioned challenges, we propose a novel approach to learn spatio-temporal features from a very large data cohort of over 16 million interventional X-ray frames using self-supervision for image sequence data. Our approach is based on a masked image modeling technique that leverages frame interpolation based reconstruction to learn fine inter-frame temporal correspondences. The features encoded in the resulting model are fine-tuned downstream. Our approach achieves state-of-the-art performance and in particular robustness compared to ultra optimized reference solutions (that use multi-stage feature fusion, multi-task and flow regularization). The experiments show that our method achieves 66.31% reduction in maximum tracking error against reference solutions (23.20% when flow regularization is used); achieving a success score of 97.95% at a 3x faster inference speed of 42 frames-per-second (on GPU). The results encourage the use of our approach in various other tasks within interventional image analytics that require effective understanding of spatio-temporal semantics.
- H. Ma, I. Smal, J. Daemen, et al., “Dynamic coronary roadmapping via catheter tip tracking in x-ray fluoroscopy with deep learning based bayesian filtering,” Medical image analysis 61, 101634 (2020).
- K. E. Odening, A.-M. Gomez, D. Dobrev, et al., “Esc working group on cardiac cellular electrophysiology position paper: relevance, opportunities, and limitations of experimental models for cardiac electrophysiology research,” EP Europace 23(11), 1795–1814 (2021).
- A. Facciorusso, R. Licinio, N. Muscatiello, et al., “Transarterial chemoembolization: evidences from the literature and applications in hepatocellular carcinoma patients,” World journal of hepatology 7(16), 2009 (2015).
- K. Piayda, L. Kleinebrecht, S. Afzal, et al., “Dynamic coronary roadmapping during percutaneous coronary intervention: a feasibility study,” European journal of medical research 23, 1–7 (2018).
- P. Wang, T. Chen, O. Ecabert, et al., “Image-based device tracking for the co-registration of angiography and intravascular ultrasound images,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2011: 14th International Conference, Toronto, Canada, September 18-22, 2011, Proceedings, Part I 14, 161–168, Springer (2011).
- T. Araki, N. Ikeda, N. Dey, et al., “A comparative approach of four different image registration techniques for quantitative assessment of coronary artery calcium lesions using intravascular ultrasound,” Computer methods and programs in biomedicine 118(2), 158–172 (2015).
- P. Wang, O. Ecabert, T. Chen, et al., “Image-based co-registration of angiography and intravascular ultrasound images,” IEEE transactions on medical imaging 32(12), 2238–2249 (2013).
- Z. Tong, Y. Song, J. Wang, et al., “Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training,” Advances in neural information processing systems 35, 10078–10093 (2022).
- A. Gupta, J. Wu, J. Deng, et al., “Siamese masked autoencoders,” arXiv preprint arXiv:2305.14344 (2023).
- J. Lin, Y. Zhang, A.-a. Amadou, et al., “Cycle ynet: semi-supervised tracking of 3d anatomical landmarks,” in Machine Learning in Medical Imaging: 11th International Workshop, MLMI 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, October 4, 2020, Proceedings 11, 593–602, Springer (2020).
- M. Demoustier, Y. Zhang, V. N. Murthy, et al., “Contrack: contextual transformer for device tracking in x-ray,” arXiv preprint arXiv:2307.07541 (2023).
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., “An image is worth 16x16 words: transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929 (2020).
- C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proceedings of the IEEE international conference on computer vision, 1422–1430 (2015).
- D. Pathak, R. Girshick, P. Dollár, et al., “Learning features by watching objects move,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2701–2710 (2017).
- S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” arXiv preprint arXiv:1803.07728 (2018).
- Z. Wu, Y. Xiong, S. X. Yu, et al., “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 3733–3742 (2018).
- K. He, H. Fan, Y. Wu, et al., “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9729–9738 (2020).
- M. Caron, H. Touvron, I. Misra, et al., “Emerging properties in self-supervised vision transformers. 2021 ieee,” in CVF International Conference on Computer Vision (ICCV), 3 (2021).
- T. Chen, S. Kornblith, M. Norouzi, et al., “A simple framework for contrastive learning of visual representations,” in International conference on machine learning, 1597–1607, PMLR (2020).
- J.-B. Grill, F. Strub, F. Altché, et al., “Bootstrap your own latent-a new approach to self-supervised learning,” Advances in neural information processing systems 33, 21271–21284 (2020).
- X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 15750–15758 (2021).
- P. Sermanet, C. Lynch, Y. Chebotar, et al., “Time-contrastive networks: Self-supervised learning from video,” in 2018 IEEE international conference on robotics and automation (ICRA), 1134–1141, IEEE (2018).
- C. Sun, F. Baradel, K. Murphy, et al., “Learning video representations using contrastive bidirectional transformer,” arXiv preprint arXiv:1906.05743 (2019).
- T. Han, W. Xie, and A. Zisserman, “Video representation learning by dense predictive coding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 0–0 (2019).
- C. Feichtenhofer, H. Fan, B. Xiong, et al., “A large-scale study on unsupervised spatiotemporal representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3299–3309 (2021).
- A. Recasens, P. Luc, J.-B. Alayrac, et al., “Broaden your views for self-supervised video learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 1255–1265 (2021).
- R. Qian, T. Meng, B. Gong, et al., “Spatiotemporal contrastive video representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6964–6974 (2021).
- N. Park, W. Kim, B. Heo, et al., “What do self-supervised vision transformers learn?,” arXiv preprint arXiv:2305.00729 (2023).
- J. Devlin, M.-W. Chang, K. Lee, et al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805 (2018).
- K. He, X. Chen, S. Xie, et al., “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16000–16009 (2022).
- H. Bao, L. Dong, S. Piao, et al., “Beit: Bert pre-training of image transformers,” arXiv preprint arXiv:2106.08254 (2021).
- Z. Xie, Z. Zhang, Y. Cao, et al., “Simmim: A simple framework for masked image modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9653–9663 (2022).
- C. Feichtenhofer, Y. Li, K. He, et al., “Masked autoencoders as spatiotemporal learners,” Advances in neural information processing systems 35, 35946–35958 (2022).
- B. Li, J. Yan, W. Wu, et al., “High performance visual tracking with siamese region proposal network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 8971–8980 (2018).
- B. Li, W. Wu, Q. Wang, et al., “Evolution of siamese visual tracking with very deep networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 (2019).
- H. Fan and H. Ling, “Siamese cascaded region proposal networks for real-time visual tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7952–7961 (2019).
- Z. Zhu, Q. Wang, B. Li, et al., “Distractor-aware siamese networks for visual object tracking,” in Proceedings of the European conference on computer vision (ECCV), 101–117 (2018).
- H. Fan and H. Ling, “Cract: Cascaded regression-align-classification for robust tracking,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 7013–7020, IEEE (2021).
- Y. Yu, Y. Xiong, W. Huang, et al., “Deformable siamese attention networks for visual object tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6728–6737 (2020).
- Z. Zhang, Y. Liu, X. Wang, et al., “Learn to match: automatic matching network design for visual tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 13339–13348 (2021).
- B. Yan, H. Peng, J. Fu, et al., “Learning spatio-temporal transformer for visual tracking,” in Proceedings of the IEEE/CVF international conference on computer vision, 10448–10457 (2021).
- Y. Cui, C. Jiang, L. Wang, et al., “Mixformer: end-to-end tracking with iterative mixed attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13608–13618 (2022).
- J. Kugarajeevan, T. Kokul, A. Ramanan, et al., “Transformers in single object tracking: an experimental survey,” IEEE Access (2023).
- N. Wang, W. Zhou, J. Wang, et al., “Transformer meets tracker: Exploiting temporal context for robust visual tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1571–1580 (2021).
- X. Chen, B. Yan, J. Zhu, et al., “Transformer tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8126–8135 (2021).
- X. Wei, Y. Bai, Y. Zheng, et al., “Autoregressive visual tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9697–9706 (2023).
- L. Lin, H. Fan, Z. Zhang, et al., “Swintrack: a simple and strong baseline for transformer tracking,” Advances in Neural Information Processing Systems 35, 16743–16754 (2022).
- J. Bromley, I. Guyon, Y. LeCun, et al., “Signature verification using a” siamese” time delay neural network,” Advances in neural information processing systems 6 (1993).
- Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, 402–419, Springer (2020).
- H. Jiang, D. Sun, V. Jampani, et al., “Super slomo: High quality estimation of multiple intermediate frames for video interpolation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2018).
- S. Niklaus, L. Mai, and F. Liu, “Video frame interpolation via adaptive separable convolution,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), (2017).
- A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” Advances in neural information processing systems 30 (2017).
- H. Fan, L. Lin, F. Yang, et al., “Lasot: A high-quality benchmark for large-scale single object tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5374–5383 (2019).