Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 15 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 82 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 436 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS Instance Segmentation (2302.11325v2)

Published 22 Feb 2023 in cs.CV and cs.AI

Abstract: This paper presents a deep learning framework for medical video segmentation. Convolution neural network (CNN) and transformer-based methods have achieved great milestones in medical image segmentation tasks due to their incredible semantic feature encoding and global information comprehension abilities. However, most existing approaches ignore a salient aspect of medical video data - the temporal dimension. Our proposed framework explicitly extracts features from neighbouring frames across the temporal dimension and incorporates them with a temporal feature blender, which then tokenises the high-level spatio-temporal feature to form a strong global feature encoded via a Swin Transformer. The final segmentation results are produced via a UNet-like encoder-decoder architecture. Our model outperforms other approaches by a significant margin and improves the segmentation benchmarks on the VFSS2022 dataset, achieving a dice coefficient of 0.8986 and 0.8186 for the two datasets tested. Our studies also show the efficacy of the temporal feature blending scheme and cross-dataset transferability of learned capabilities. Code and models are fully available at https://github.com/SimonZeng7108/Video-SwinUNet.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. D. G. Smithard, N. Smeeton, and C. D. Wolfe, “Long-term outcome after stroke: does dysphagia matter?” Age and ageing, vol. 36 1, pp. 90–4, 2007.
  2. D. G. Smithard, “Dysphagia: A geriatric giant?” 2016.
  3. D. G. Smithard, P. A. O’Neill, R. E. England, C. L. Park, R. Wyatt, D. F. Martin, and J. Morris, “The natural history of dysphagia following a stroke,” Dysphagia, vol. 12, pp. 188 –193, 1997.
  4. D. J. Ramsey, D. G. Smithard, and L. Kalra, “Early assessments of dysphagia and aspiration risk in acute stroke patients,” Stroke: Journal of the American Heart Association, vol. 34, pp. 1252–1257, 2003.
  5. Y. Zheng, M. S. Nixon, and R. Allen, “Automated segmentation of lumbar vertebrae in digital videofluoroscopic images,” IEEE Transactions on Medical Imaging, vol. 23, pp. 45–52, 2004.
  6. P. M. Kellen, D. Becker, J. M. Reinhardt, and D. J. V. Daele, “Computer-assisted assessment of hyoid bone motion from videofluoroscopic swallow studies,” Dysphagia, vol. 25, pp. 298–306, 2009.
  7. S. Noorwali, “Semi-automatic tracking of the hyoid bone and the epiglottis movements in digital videofluoroscopic images,” 2013.
  8. J. T. Lee and E. Park, “Detection of the pharyngeal phase in the videofluoroscopic swallowing study using inflated 3d convolutional networks,” in MLMI@MICCAI, 2018.
  9. J. T. Lee, E. Park, and T.-D. Jung, “Automatic detection of the pharyngeal phase in raw videos for the videofluoroscopic swallowing study using efficient data collection and 3d convolutional networks †,” Sensors (Basel, Switzerland), vol. 19, 2019.
  10. K.-S. Lee, E. Lee, B. Choi, and S.-B. Pyun, “Automatic pharyngeal phase recognition in untrimmed videofluoroscopic swallowing study using transfer learning with deep convolutional neural networks,” Diagnostics, vol. 11, 2021.
  11. J. T. Lee, E. Park, J.-M. Hwang, T.-D. Jung, and D. Park, “Machine learning analysis to automatically measure response time of pharyngeal swallowing reflex in videofluoroscopic swallowing study,” Scientific Reports, vol. 10, 2020.
  12. H. Kim, Y. Kim, B. Kim, D. Y. Shin, S. J. Lee, and S.-I. Choi, “Hyoid bone tracking in a videofluoroscopic swallowing study using a deep-learning-based segmentation network,” Diagnostics, vol. 11, 2021.
  13. D. Lee, W. H. Lee, H. G. Seo, B.-M. Oh, J. C. Lee, and H. C. Kim, “Online learning for the hyoid bone tracking during swallowing with neck movement adjustment using semantic segmentation,” IEEE Access, vol. 8, pp. 157 451–157 461, 2020.
  14. S. Feng, Q. T.-K. Shea, K.-Y. Ng, C.-N. Tang, E. Kwong, and Y. Zheng, “Automatic hyoid bone tracking in real-time ultrasound swallowing videos using deep learning based and correlation filter based trackers,” Sensors (Basel, Switzerland), vol. 21, 2021.
  15. A. Iyer, M. Thor, R. Haq, J. O. Deasy, and A. P. Apte, “Deep learning-based auto-segmentation of swallowing and chewing structures in ct,” bioRxiv, 2019.
  16. H. Caliskan, A. S. Mahoney, J. L. Coyle, and E. Sejdić, “Automated bolus detection in videofluoroscopic images of swallowing using mask-rcnn,” 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pp. 2173–2177, 2020.
  17. Z. Zhang, E. Lucatorto, J.Coyles, and E. Sejdić, “Deep learning-based auto-segmentation and evaluation of vallecular residue in videofluoroscopy,” SSRN Electronic Journal, 2021.
  18. C. Zeng, X. Yang, M. Mirmehdi, A. M. Gambaruto, and T. Burghardt, “Video-transunet: Temporally blended vision transformer for ct vfss instance segmentation,” Proceedings of International Conference of Machine Vision, SPIE, 2022.
  19. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” ArXiv, vol. abs/1505.04597, 2015.
  20. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” ArXiv, vol. abs/2010.11929, 2020.
  21. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10 002, 2021.
  22. J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” ArXiv, vol. abs/2102.04306, 2021.
  23. H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” ArXiv, vol. abs/2105.05537, 2021.
  24. Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “Gcnet: Non-local networks meet squeeze-excitation networks and beyond,” 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 1971–1980, 2019.
  25. X. Yang, M. Mirmehdi, and T. Burghardt, “Great ape detection in challenging jungle camera trap footage via attention-based spatial and temporal feature blending,” 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 255–262, 2019.
  26. D. Shi, R. Liu, L. Tao, Z. He, and L. Huo, “Multi-encoder parse-decoder network for sequential medical image segmentation,” in 2021 IEEE International Conference on Image Processing (ICIP), 2021, pp. 31–35.
  27. D. Stoyanov, Z. A. Taylor, G. Carneiro, T. F. Syeda-Mahmood, A. L. Martel, L. Maier-Hein, J. M. R. Tavares, A. P. Bradley, J. P. Papa, V. Belagiannis, J. C. Nascimento, Z. Lu, S. Conjeti, M. Moradi, H. Greenspan, and A. Madabhushi, “Deep learning in medical image analysis and multimodal learning for clinical decision support: 4th international workshop, dlmia 2018, and 8th international workshop, ml-cds 2018, held in conjunction with miccai 2018, granada, spain, september 20, 2018, proceedings,” Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, 2018.
  28. Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep residual u-net,” IEEE Geoscience and Remote Sensing Letters, vol. 15, pp. 749–753, 2017.
  29. O. Oktay, J. Schlemper, L. L. Folgoc, M. J. Lee, M. P. Heinrich, K. Misawa, K. Mori, S. G. McDonagh, N. Y. Hammerla, B. Kainz, B. Glocker, and D. Rueckert, “Attention u-net: Learning where to look for the pancreas,” ArXiv, vol. abs/1804.03999, 2018.
  30. S. Warfield, K. H. Zou, and W. M. Wells, “Simultaneous truth and performance level estimation (staple): an algorithm for the validation of image segmentation,” IEEE Transactions on Medical Imaging, vol. 23, pp. 903–921, 2004.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com