Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Supervised Learning for Interventional Image Analytics: Towards Robust Device Trackers (2405.01156v1)

Published 2 May 2024 in cs.CV and cs.AI

Abstract: An accurate detection and tracking of devices such as guiding catheters in live X-ray image acquisitions is an essential prerequisite for endovascular cardiac interventions. This information is leveraged for procedural guidance, e.g., directing stent placements. To ensure procedural safety and efficacy, there is a need for high robustness no failures during tracking. To achieve that, one needs to efficiently tackle challenges, such as: device obscuration by contrast agent or other external devices or wires, changes in field-of-view or acquisition angle, as well as the continuous movement due to cardiac and respiratory motion. To overcome the aforementioned challenges, we propose a novel approach to learn spatio-temporal features from a very large data cohort of over 16 million interventional X-ray frames using self-supervision for image sequence data. Our approach is based on a masked image modeling technique that leverages frame interpolation based reconstruction to learn fine inter-frame temporal correspondences. The features encoded in the resulting model are fine-tuned downstream. Our approach achieves state-of-the-art performance and in particular robustness compared to ultra optimized reference solutions (that use multi-stage feature fusion, multi-task and flow regularization). The experiments show that our method achieves 66.31% reduction in maximum tracking error against reference solutions (23.20% when flow regularization is used); achieving a success score of 97.95% at a 3x faster inference speed of 42 frames-per-second (on GPU). The results encourage the use of our approach in various other tasks within interventional image analytics that require effective understanding of spatio-temporal semantics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. H. Ma, I. Smal, J. Daemen, et al., “Dynamic coronary roadmapping via catheter tip tracking in x-ray fluoroscopy with deep learning based bayesian filtering,” Medical image analysis 61, 101634 (2020).
  2. K. E. Odening, A.-M. Gomez, D. Dobrev, et al., “Esc working group on cardiac cellular electrophysiology position paper: relevance, opportunities, and limitations of experimental models for cardiac electrophysiology research,” EP Europace 23(11), 1795–1814 (2021).
  3. A. Facciorusso, R. Licinio, N. Muscatiello, et al., “Transarterial chemoembolization: evidences from the literature and applications in hepatocellular carcinoma patients,” World journal of hepatology 7(16), 2009 (2015).
  4. K. Piayda, L. Kleinebrecht, S. Afzal, et al., “Dynamic coronary roadmapping during percutaneous coronary intervention: a feasibility study,” European journal of medical research 23, 1–7 (2018).
  5. P. Wang, T. Chen, O. Ecabert, et al., “Image-based device tracking for the co-registration of angiography and intravascular ultrasound images,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2011: 14th International Conference, Toronto, Canada, September 18-22, 2011, Proceedings, Part I 14, 161–168, Springer (2011).
  6. T. Araki, N. Ikeda, N. Dey, et al., “A comparative approach of four different image registration techniques for quantitative assessment of coronary artery calcium lesions using intravascular ultrasound,” Computer methods and programs in biomedicine 118(2), 158–172 (2015).
  7. P. Wang, O. Ecabert, T. Chen, et al., “Image-based co-registration of angiography and intravascular ultrasound images,” IEEE transactions on medical imaging 32(12), 2238–2249 (2013).
  8. Z. Tong, Y. Song, J. Wang, et al., “Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training,” Advances in neural information processing systems 35, 10078–10093 (2022).
  9. A. Gupta, J. Wu, J. Deng, et al., “Siamese masked autoencoders,” arXiv preprint arXiv:2305.14344 (2023).
  10. J. Lin, Y. Zhang, A.-a. Amadou, et al., “Cycle ynet: semi-supervised tracking of 3d anatomical landmarks,” in Machine Learning in Medical Imaging: 11th International Workshop, MLMI 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, October 4, 2020, Proceedings 11, 593–602, Springer (2020).
  11. M. Demoustier, Y. Zhang, V. N. Murthy, et al., “Contrack: contextual transformer for device tracking in x-ray,” arXiv preprint arXiv:2307.07541 (2023).
  12. A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., “An image is worth 16x16 words: transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929 (2020).
  13. C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proceedings of the IEEE international conference on computer vision, 1422–1430 (2015).
  14. D. Pathak, R. Girshick, P. Dollár, et al., “Learning features by watching objects move,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2701–2710 (2017).
  15. S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” arXiv preprint arXiv:1803.07728 (2018).
  16. Z. Wu, Y. Xiong, S. X. Yu, et al., “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 3733–3742 (2018).
  17. K. He, H. Fan, Y. Wu, et al., “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9729–9738 (2020).
  18. M. Caron, H. Touvron, I. Misra, et al., “Emerging properties in self-supervised vision transformers. 2021 ieee,” in CVF International Conference on Computer Vision (ICCV), 3 (2021).
  19. T. Chen, S. Kornblith, M. Norouzi, et al., “A simple framework for contrastive learning of visual representations,” in International conference on machine learning, 1597–1607, PMLR (2020).
  20. J.-B. Grill, F. Strub, F. Altché, et al., “Bootstrap your own latent-a new approach to self-supervised learning,” Advances in neural information processing systems 33, 21271–21284 (2020).
  21. X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 15750–15758 (2021).
  22. P. Sermanet, C. Lynch, Y. Chebotar, et al., “Time-contrastive networks: Self-supervised learning from video,” in 2018 IEEE international conference on robotics and automation (ICRA), 1134–1141, IEEE (2018).
  23. C. Sun, F. Baradel, K. Murphy, et al., “Learning video representations using contrastive bidirectional transformer,” arXiv preprint arXiv:1906.05743 (2019).
  24. T. Han, W. Xie, and A. Zisserman, “Video representation learning by dense predictive coding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 0–0 (2019).
  25. C. Feichtenhofer, H. Fan, B. Xiong, et al., “A large-scale study on unsupervised spatiotemporal representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3299–3309 (2021).
  26. A. Recasens, P. Luc, J.-B. Alayrac, et al., “Broaden your views for self-supervised video learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 1255–1265 (2021).
  27. R. Qian, T. Meng, B. Gong, et al., “Spatiotemporal contrastive video representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6964–6974 (2021).
  28. N. Park, W. Kim, B. Heo, et al., “What do self-supervised vision transformers learn?,” arXiv preprint arXiv:2305.00729 (2023).
  29. J. Devlin, M.-W. Chang, K. Lee, et al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805 (2018).
  30. K. He, X. Chen, S. Xie, et al., “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16000–16009 (2022).
  31. H. Bao, L. Dong, S. Piao, et al., “Beit: Bert pre-training of image transformers,” arXiv preprint arXiv:2106.08254 (2021).
  32. Z. Xie, Z. Zhang, Y. Cao, et al., “Simmim: A simple framework for masked image modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9653–9663 (2022).
  33. C. Feichtenhofer, Y. Li, K. He, et al., “Masked autoencoders as spatiotemporal learners,” Advances in neural information processing systems 35, 35946–35958 (2022).
  34. B. Li, J. Yan, W. Wu, et al., “High performance visual tracking with siamese region proposal network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 8971–8980 (2018).
  35. B. Li, W. Wu, Q. Wang, et al., “Evolution of siamese visual tracking with very deep networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 (2019).
  36. H. Fan and H. Ling, “Siamese cascaded region proposal networks for real-time visual tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7952–7961 (2019).
  37. Z. Zhu, Q. Wang, B. Li, et al., “Distractor-aware siamese networks for visual object tracking,” in Proceedings of the European conference on computer vision (ECCV), 101–117 (2018).
  38. H. Fan and H. Ling, “Cract: Cascaded regression-align-classification for robust tracking,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 7013–7020, IEEE (2021).
  39. Y. Yu, Y. Xiong, W. Huang, et al., “Deformable siamese attention networks for visual object tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6728–6737 (2020).
  40. Z. Zhang, Y. Liu, X. Wang, et al., “Learn to match: automatic matching network design for visual tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 13339–13348 (2021).
  41. B. Yan, H. Peng, J. Fu, et al., “Learning spatio-temporal transformer for visual tracking,” in Proceedings of the IEEE/CVF international conference on computer vision, 10448–10457 (2021).
  42. Y. Cui, C. Jiang, L. Wang, et al., “Mixformer: end-to-end tracking with iterative mixed attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13608–13618 (2022).
  43. J. Kugarajeevan, T. Kokul, A. Ramanan, et al., “Transformers in single object tracking: an experimental survey,” IEEE Access (2023).
  44. N. Wang, W. Zhou, J. Wang, et al., “Transformer meets tracker: Exploiting temporal context for robust visual tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1571–1580 (2021).
  45. X. Chen, B. Yan, J. Zhu, et al., “Transformer tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8126–8135 (2021).
  46. X. Wei, Y. Bai, Y. Zheng, et al., “Autoregressive visual tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9697–9706 (2023).
  47. L. Lin, H. Fan, Z. Zhang, et al., “Swintrack: a simple and strong baseline for transformer tracking,” Advances in Neural Information Processing Systems 35, 16743–16754 (2022).
  48. J. Bromley, I. Guyon, Y. LeCun, et al., “Signature verification using a” siamese” time delay neural network,” Advances in neural information processing systems 6 (1993).
  49. Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, 402–419, Springer (2020).
  50. H. Jiang, D. Sun, V. Jampani, et al., “Super slomo: High quality estimation of multiple intermediate frames for video interpolation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2018).
  51. S. Niklaus, L. Mai, and F. Liu, “Video frame interpolation via adaptive separable convolution,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), (2017).
  52. A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” Advances in neural information processing systems 30 (2017).
  53. H. Fan, L. Lin, F. Yang, et al., “Lasot: A high-quality benchmark for large-scale single object tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5374–5383 (2019).

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets