Motion-Corrected Moving Average: Including Post-Hoc Temporal Information for Improved Video Segmentation
Abstract: Real-time computational speed and a high degree of precision are requirements for computer-assisted interventions. Applying a segmentation network to a medical video processing task can introduce significant inter-frame prediction noise. Existing approaches can reduce inconsistencies by including temporal information but often impose requirements on the architecture or dataset. This paper proposes a method to include temporal information in any segmentation model and, thus, a technique to improve video segmentation performance without alterations during training or additional labeling. With Motion-Corrected Moving Average, we refine the exponential moving average between the current and previous predictions. Using optical flow to estimate the movement between consecutive frames, we can shift the prior term in the moving-average calculation to align with the geometry of the current frame. The optical flow calculation does not require the output of the model and can therefore be performed in parallel, leading to no significant runtime penalty for our approach. We evaluate our approach on two publicly available segmentation datasets and two proprietary endoscopic datasets and show improvements over a baseline approach.
- Real-time segmentation of non-rigid surgical tools based on deep learning and tracking. In Computer-Assisted and Robotic Endoscopy: Third International Workshop, CARE 2016, pages 84–95. Springer, 2017.
- Multi-frame feature aggregation for real-time instrument segmentation in endoscopic video. IEEE Robotics and Automation Letters, 6(4):6773–6780, 2021.
- Efficient global-local memory for real-time instrument segmentation of robotic surgical video. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, pages 341–351. Springer International Publishing, 2021a.
- End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8741–8750, 2021b.
- Video panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9859–9868, 2020.
- Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1195–1204. Curran Associates, Inc., 2017.
- Momentum contrast for unsupervised visual representation learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- Error-correcting mean-teacher: Corrections instead of consistency-targets applied to semi-supervised medical image segmentation. Computers in Biology and Medicine, 154:106585, 2023.
- Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. In Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, June 29–July 2, 2003 Proceedings 13, pages 363–370. Springer, 2003.
- Sift flow: Dense correspondence across scenes and its applications. IEEE transactions on pattern analysis and machine intelligence, 33(5):978–994, 2010.
- Deep feature flow for video recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4141–4150, 2017a. doi: 10.1109/CVPR.2017.441.
- Accel: A corrective fusion network for efficient semantic segmentation on video. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8858–8867, 2019. doi: 10.1109/CVPR.2019.00907.
- Flow-guided feature aggregation for video object detection. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 408–417, 2017b.
- Left ventricle segmentation via optical-flow-net from short-axis cine mri: Preserving the temporal coherence of cardiac motion. In Alejandro F. Frangi, Julia A. Schnabel, Christos Davatzikos, Carlos Alberola-López, and Gabor Fichtinger, editors, Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, pages 613–621, Cham, 2018. Springer International Publishing.
- STFCN: spatio-temporal fully convolutional neural network for semantic segmentation of street scenes. In Computer Vision - ACCV 2016 Workshops, volume 10116 of Lecture Notes in Computer Science, pages 493–509. Springer, 2016.
- Semantic video cnns through representation warping. In Proceedings of the IEEE International Conference on Computer Vision, pages 4453–4462, 2017.
- Semantic video segmentation by gated recurrent flow propagation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6819–6828, 2018.
- Joint optical flow and temporally consistent semantic segmentation. In Computer Vision–ECCV 2016 Workshops, volume 9913, pages 163–177. Springer, 2016.
- Deep spatio-temporal random fields for efficient video segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 8915–8924, 2018.
- Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:640–651, 2017.
- Semantic image segmentation with deep convolutional nets and fully connected crfs. CoRR, abs/1412.7062, 2014.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:834–848, 2016.
- Rethinking atrous convolution for semantic image segmentation. ArXiv, abs/1706.05587, 2017. URL https://api.semanticscholar.org/CorpusID:22655199.
- Encoder-decoder with atrous separable convolution for semantic image segmentation. In The European Conference on Computer Vision (ECCV), September 2018.
- Multi-scale context aggregation by dilated convolutions. CoRR, abs/1511.07122, 2015.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2016.
- Mobilenets: Efficient convolutional neural networks for mobile vision applications. ArXiv, abs/1704.04861, 2017.
- Pyramid scene parsing network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6230–6239, 2016.
- Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37:1904–1916, 2014.
- Clockwork convnets for video semantic segmentation. In Computer Vision–ECCV 2016 Workshops, pages 852–868. Springer, 2016.
- Deep feature flow for video recognition. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4141–4150, 2016.
- Budget-aware deep semantic video segmentation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2077–2086, 2017.
- Dynamic video segmentation network. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6556–6565, 2018.
- Temporally distributed networks for fast video semantic segmentation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8815–8824, 2020.
- Surgical tool segmentation using a hybrid deep CNN-RNN auto encoder-decoder. In 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 3373–3378. IEEE, 2017.
- Drr-net: A dense-connected residual recurrent convolutional network for surgical instrument segmentation from endoscopic images. IEEE Transactions on Medical Robotics and Bionics, 4(3):696–707, 2022.
- Preserving the temporal consistency of video sequences for surgical instruments segmentation. In 2021 3rd International Conference on Intelligent Medicine and Image Processing (IMIP), pages 78–82. ACM, 2021.
- Learning motion flows for semi-supervised instrument segmentation from robotic surgical video. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2020, pages 679–689. Springer International Publishing, 2020.
- ST-MTL: spatio-temporal multitask learning model to predict scanpath while tracking instruments in robotic surgery. Medical Image Analysis, 67:101837, 2021.
- Automatic sinus surgery skill assessment based on instrument segmentation and tracking in endoscopic video. In 2019 First International Multiscale Multimodal Medical Imaging (MMMI) Workshop, volume 11977 of Lecture Notes in Computer Science, pages 93–100. Springer International Publishing, 2019.
- Fun-sis: A fully unsupervised approach for surgical instrument segmentation. Medical Image Analysis, 85:102751, 2023.
- Comparative validation of multi-instance instrument segmentation in endoscopy: results of the robust-mis 2019 challenge. Medical Image Analysis, 70:101920, 2021.
- Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.
- Thomas G Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7):1895–1923, 1998.
- The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.