Hybrid Tracker with Pixel and Instance for Video Panoptic Segmentation (2203.01217v2)
Abstract: Video Panoptic Segmentation (VPS) aims to generate coherent panoptic segmentation and track the identities of all pixels across video frames. Existing methods predominantly utilize the trained instance embedding to keep the consistency of panoptic segmentation. However, they inevitably struggle to cope with the challenges of small objects, similar appearance but inconsistent identities, occlusion, and strong instance contour deformations. To address these problems, we present HybridTracker, a lightweight and joint tracking model attempting to eliminate the limitations of the single tracker. HybridTracker performs pixel tracker and instance tracker in parallel to obtain the association matrices, which are fused into a matching matrix. In the instance tracker, we design a differentiable matching layer, ensuring the stability of inter-frame matching. In the pixel tracker, we compute the dice coefficient of the same instance of different frames given the estimated optical flow, forming the Intersection Over Union (IoU) matrix. We additionally propose mutual check and temporal consistency constraints during inference to settle the occlusion and contour deformation challenges. Comprehensive experiments show that HybridTracker achieves superior performance than state-of-the-art methods on Cityscapes-VPS and VIPER datasets.
- D. Kim, S. Woo, J.-Y. Lee, and I. S. Kweon, “Video Panoptic Segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Y. Xiong, R. Liao, H. Zhao, R. Hu, M. Bai, E. Yumer, and R. Urtasun, “UPSNet: A Unified Panoptic Segmentation Network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 8818–8826.
- L. Yang, Y. Fan, and N. Xu, “Video Instance Segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 5188–5197.
- S. Woo, D. Kim, J.-Y. Lee, and I. S. Kweon, “Learning to Associate Every Segment for Video Panoptic Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2705–2714.
- X. Zhou, V. Koltun, and P. Krähenbühl, “Tracking Objects as Points,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 474–490.
- Y. Li, H. Zhao, X. Qi, L. Wang, Z. Li, J. Sun, and J. Jia, “Fully Convolutional Networks for Panoptic Segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 214–223.
- A. Hermans, L. Beyer, and B. Leibe, “In Defense of the Triplet Loss for Person Re-Identification,” arXiv preprint arXiv:1703.07737, 2017.
- W. Ye, H. Li, T. Zhang, X. Zhou, H. Bao, and G. Zhang, “SuperPlane: 3D Plane Detection and Description from a Single Image,” in 2021 IEEE Virtual Reality and 3D User Interfaces (VR), 2021, pp. 207–215.
- Z. Teed and J. Deng, “RAFT: Recurrent All-Pairs Field Transforms for Optical Flow,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 402–419.
- W. Wang, T. Zhou, F. Porikli, D. Crandall, and L. Van Gool, “A Survey on Deep Learning Technique for Video Segmentation,” arXiv preprint arXiv:2107.01153, 2021.
- J. Li, W. Wang, J. Chen, L. Niu, J. Si, C. Qian, and L. Zhang, “Video Semantic Segmentation via Sparse Temporal Transformer,” in Proceedings of the 29th ACM International Conference on Multimedia (ACM MM), 2021, pp. 59–68.
- L. Hoyer, D. Dai, Y. Chen, A. Koring, S. Saha, and L. Van Gool, “Three Ways to Improve Semantic Segmentation with Self-Supervised Depth Estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 11 130–11 140.
- L.-C. Chen, R. G. Lopes, B. Cheng, M. D. Collins, E. D. Cubuk, B. Zoph, H. Adam, and J. Shlens, “Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences for Urban Scene Segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 695–714.
- Y. Liu, C. Shen, C. Yu, and J. Wang, “Efficient Semantic Video Segmentation with Per-frame Inference,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 352–368.
- Z. Yin, J. Zheng, W. Luo, S. Qian, H. Zhang, and S. Gao, “Learning to Recommend Frame for Interactive Video Object Segmentation in the Wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 15 445–15 454.
- Y. Heo, Y. J. Koh, and C.-S. Kim, “Guided Interactive Video Object Segmentation Using Reliability-Based Attention Maps,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 7322–7330.
- H. K. Cheng, Y.-W. Tai, and C.-K. Tang, “Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 5559–5568.
- Z. Xu, W. Zhang, X. Tan, W. Yang, H. Huang, S. Wen, E. Ding, and L. Huang, “Segment as Points for Efficient Online Multi-Object Tracking and Segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 264–281.
- Y. Li, N. Xu, J. Peng, J. See, and W. Lin, “Delving into the Cyclic Mechanism in Semi-supervised Video Object Segmentation,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1218–1228, 2020.
- H. Xie, H. Yao, S. Zhou, S. Zhang, and W. Sun, “Efficient Regional Memory Network for Video Object Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 1286–1295.
- B. Duke, A. Ahmed, C. Wolf, P. Aarabi, and G. W. Taylor, “SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 5912–5921.
- S. Ren, W. Liu, Y. Liu, H. Chen, G. Han, and S. He, “Reciprocal Transformations for Unsupervised Video Object Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 15 455–15 464.
- L. Porzi, M. Hofinger, I. Ruiz, J. Serrat, S. R. Bulo, and P. Kontschieder, “Learning Multi-Object Tracking and Segmentation from Automatic Annotations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 6846–6855.
- J. Luiten, I. E. Zulfikar, and B. Leibe, “UnOVOST: Unsupervised Offline Video Object Segmentation and Tracking,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 2000–2009.
- Y. Yang, B. Lai, and S. Soatto, “DyStaB: Unsupervised Object Segmentation via Dynamic-Static Bootstrapping,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2826–2836.
- T. Hui, S. Huang, S. Liu, Z. Ding, G. Li, W. Wang, J. Han, and F. Wang, “Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 4187–4196.
- L. Ye, M. Rochan, Z. Liu, X. Zhang, and Y. Wang, “Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network,” arXiv preprint arXiv:2102.04762, 2021.
- B. McIntosh, K. Duarte, Y. S. Rawat, and M. Shah, “Visual-Textual Capsule Routing for Text-Based Video Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9942–9951.
- T. Zhou, J. Li, X. Li, and L. Shao, “Target-aware object discovery and association for unsupervised video multi-object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6985–6994.
- C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques, and X. Giro-i Nieto, “RVOS: End-to-End Recurrent Network for Video Object Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5277–5286.
- Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia, “End-to-End Video Instance Segmentation with Transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- A. Athar, S. Mahadevan, A. Osep, L. Leal-Taixé, and B. Leibe, “STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 158–177.
- G. Bertasius and L. Torresani, “Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9739–9748.
- J. Cao, R. M. Anwer, H. Cholakkal, F. S. Khan, Y. Pang, and L. Shao, “SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2020.
- X. Ying, X. Li, and M. C. Chuah, “SRNet: Spatial Relation Network for Efficient Single-stage Instance Segmentation in Videos,” in Proceedings of the 29th ACM International Conference on Multimedia (ACM MM), 2021, pp. 347–356.
- H. Lin, R. Wu, S. Liu, J. Lu, and J. Jia, “Video Instance Segmentation with a Propose-Reduce Paradigm,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1739–1748.
- S. Qiao, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 3997–4008.
- M. Weber, J. Xie, M. Collins, Y. Zhu, P. Voigtlaender, H. Adam, B. Green, A. Geiger, B. Leibe, D. Cremers, A. Osep, L. Leal-Taixe, and L.-C. Chen, “STEP: Segmenting and Tracking Every Pixel,” in Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2021.
- Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking,” arXiv preprint arXiv:2004.01888, 2020.
- J. Pang, L. Qiu, X. Li, H. Chen, Q. Li, T. Darrell, and F. Yu, “Quasi-Dense Similarity Learning for Multiple Object Tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 164–173.
- Z. Wang, L. Zheng, Y. Liu, and S. Wang, “Towards Real-Time Multi-Object Tracking,” arXiv preprint arXiv:1909.12605, 2019.
- K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation,” in Proceedings of the International Conference on 3D Vision (3DV). IEEE, 2016, pp. 565–571.
- T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detection,” 2018.
- H. Yu, W. Ye, Y. Feng, H. Bao, and G. Zhang, “Learning Bipartite Graph Matching for Robust Visual Localization,” in 2020 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 2020, pp. 146–155.
- Q. Wang, X. Zhou, B. Hariharan, and N. Snavely, “Learning Feature Descriptors using Camera Pose Supervision,” in Proceedings of the European Conference on Computer Vision (ECCV), 2020.
- X. Wang, A. Jabri, and A. A. Efros, “Learning Correspondence from the Cycle-Consistency of Time,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.