MAST: Video Polyp Segmentation with a Mixture-Attention Siamese Transformer (2401.12439v1)
Abstract: Accurate segmentation of polyps from colonoscopy videos is of great significance to polyp treatment and early prevention of colorectal cancer. However, it is challenging due to the difficulties associated with modelling long-range spatio-temporal relationships within a colonoscopy video. In this paper, we address this challenging task with a novel Mixture-Attention Siamese Transformer (MAST), which explicitly models the long-range spatio-temporal relationships with a mixture-attention mechanism for accurate polyp segmentation. Specifically, we first construct a Siamese transformer architecture to jointly encode paired video frames for their feature representations. We then design a mixture-attention module to exploit the intra-frame and inter-frame correlations, enhancing the features with rich spatio-temporal relationships. Finally, the enhanced features are fed to two parallel decoders for predicting the segmentation maps. To the best of our knowledge, our MAST is the first transformer model dedicated to video polyp segmentation. Extensive experiments on the large-scale SUN-SEG benchmark demonstrate the superior performance of MAST in comparison with the cutting-edge competitors. Our code is publicly available at https://github.com/Junqing-Yang/MAST.
- The miss rate for colorectal adenoma determined by quality-adjusted, back-to-back colonoscopies. Gut and liver 6, 64.
- Global patterns and trends in colorectal cancer incidence and mortality. Gut 66, 683–691.
- Diagnosis and treatment of metastatic colorectal cancer: a review. JAMA 325, 669–685.
- Fully convolutional neural networks for polyp segmentation in colonoscopy, in: Medical Imaging 2017: Computer-Aided Diagnosis, p. 101340F.
- GCNet: Non-local networks meet squeeze-excitation networks and beyond, in: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 1971–1980.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR .
- Structure-measure: A new way to evaluate foreground maps, in: ICCV, pp. 4548–4557.
- Concealed object detection. IEEE TPAMI 44, 6024–6042.
- Cognitive vision inspired object segmentation metric and loss function. SCIENTIA SINICA Informationis 6, 6.
- Pranet: Parallel reverse attention network for polyp segmentation, in: MICCAI, Springer. pp. 263–273.
- Pyramid constrained self-attention network for fast video salient object detection. Proceedings of the AAAI Conference on Artificial Intelligence 34, 10869–10876.
- A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 87–110.
- Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141.
- Polyp segmentation method for ct colonography computer-aided detection, in: MIPFMSA, SPIE. pp. 359–369.
- Progressively normalized self-attention network for video polyp segmentation, in: MICCAI, Springer. pp. 142–152.
- Sam struggles in concealed scenes–empirical study on” segment anything”. Science China Information Sciences .
- Full-duplex strategy for video object segmentation, in: ICCV, pp. 4922–4933.
- Video polyp segmentation: A deep learning perspective. Machine Intelligence Research 19, 531–549.
- UACANet: Uncertainty augmented context attention for polyp segmentation, in: Proceedings of the 29th ACM International Conference on Multimedia, Association for Computing Machinery. p. 2167–2175.
- Tccnet: Temporally consistent context-free network for semi-supervised video polyp segmentation, in: IJCAI, International Joint Conferences on Artificial Intelligence Organization. pp. 1109–1115.
- BSCA-Net: Bit slicing context attention network for polyp segmentation. Pattern Recognition 132, 108917.
- The emergence of objectness: Learning zero-shot segmentation from videos. NeurIPS 34, 13137–13152.
- Receptive field block net for accurate and fast object detection, in: Proceedings of the European Conference on Computer Vision (ECCV), pp. 404–419.
- Hierarchical question-image co-attention for visual question answering, in: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.
- See more, know more: Unsupervised video object segmentation with co-attention siamese networks, in: CVPR, pp. 3623–3632.
- Zero-shot video object segmentation with co-attention siamese networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 2228–2242.
- Polypsegnet: A modified encoder-decoder architecture for automated polyp segmentation from colonoscopy images. Computers in Biology and Medicine 128, 104119.
- Automated polyp detection in colon capsule endoscopy. IEEE Transactions on Medical Imaging 33, 1488–1502.
- How to evaluate foreground maps, in: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255.
- Development of a computer-aided detection system for colonoscopy and a publicly accessible large colonoscopy video database (with video). Gastrointestinal endoscopy 93, 960–967.
- Recurrent models of visual attention, in: Advances in Neural Information Processing Systems.
- Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6087–6096.
- Endoscopic polyp segmentation using a hybrid 2d/3d cnn, in: MICCAI, Springer. pp. 295–305.
- U-net: Convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 234–241.
- Automated polyp detection in colonoscopy videos using shape and context information. IEEE Transactions on Medical Imaging 35, 630–644.
- TGANet: Text-guided attention for improved polyp segmentation, in: Medical Image Computing and Computer Assisted Intervention – MICCAI 2022, pp. 151–160.
- Inferring salient objects from human fixations. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 1913–1927.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 548–558.
- Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media , 415–424.
- Shallow attention network for polyp segmentation, in: MICCAI, pp. 699–708.
- CBAM: Convolutional block attention module, in: Proceedings of the European Conference on Computer Vision (ECCV).
- Collaborative and adversarial learning of focused and dispersive representations for semi-supervised polyp segmentation, in: CVPR, pp. 3489–3498.
- Precise yet efficient semantic calibration and refinement in convnets for real-time polyp segmentation from colonoscopy videos, in: AAAI, pp. 2916–2924.
- Multi-frame collaboration for effective endoscopic video polyp detection via spatial-temporal feature transformation, in: MICCAI, Springer. pp. 302–312.
- Are you talking to me? reasoned visual dialog generation through adversarial learning, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6106–6115.
- Show, attend and tell: Neural image caption generation with visual attention, in: International conference on machine learning, PMLR. pp. 2048–2057.
- Focus u-net: A novel dual attention-gated cnn for polyp segmentation during colonoscopy. Computers in biology and medicine 137, 104815.
- Duplex contextual relation network for polyp segmentation, in: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), IEEE. pp. 1–5.
- Attention-guided pyramid context network for polyp segmentation in colonoscopy images. IEEE Transactions on Instrumentation and Measurement 72, 1–13.
- Dynamic context-sensitive filtering network for video salient object detection, in: CVPR, pp. 1553–1563.
- Lesion-aware dynamic kernel for polyp segmentation, in: MICCAI, Springer. pp. 99–109.
- Adaptive context selection for polyp segmentation, in: MICCAI, Springer. pp. 253–262.
- Learning synergistic attention for light field salient object detection, in: BMVC.
- Egnet: Edge guidance network for salient object detection, in: CVPR, pp. 8779–8788.
- Semi-supervised spatial temporal attention network for video polyp segmentation, in: MICCAI, Springer. pp. 456–466.
- Matnet: Motion-attentive transition network for zero-shot video object segmentation. IEEE TIP 29, 8326–8338.
- Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Transactions on Medical Imaging .
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.