Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DVIS++: Improved Decoupled Framework for Universal Video Segmentation (2312.13305v1)

Published 20 Dec 2023 in cs.CV

Abstract: We present the \textbf{D}ecoupled \textbf{VI}deo \textbf{S}egmentation (DVIS) framework, a novel approach for the challenging task of universal video segmentation, including video instance segmentation (VIS), video semantic segmentation (VSS), and video panoptic segmentation (VPS). Unlike previous methods that model video segmentation in an end-to-end manner, our approach decouples video segmentation into three cascaded sub-tasks: segmentation, tracking, and refinement. This decoupling design allows for simpler and more effective modeling of the spatio-temporal representations of objects, especially in complex scenes and long videos. Accordingly, we introduce two novel components: the referring tracker and the temporal refiner. These components track objects frame by frame and model spatio-temporal representations based on pre-aligned features. To improve the tracking capability of DVIS, we propose a denoising training strategy and introduce contrastive learning, resulting in a more robust framework named DVIS++. Furthermore, we evaluate DVIS++ in various settings, including open vocabulary and using a frozen pre-trained backbone. By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework. We conduct extensive experiments on six mainstream benchmarks, including the VIS, VSS, and VPS datasets. Using a unified architecture, DVIS++ significantly outperforms state-of-the-art specialized methods on these benchmarks in both close- and open-vocabulary settings. Code:~\url{https://github.com/zhang-tao-whu/DVIS_Plus}.

Analysis of "Bare Advanced Demo of IEEEtran.cls for IEEE Computer Society Journals"

The paper "Bare Advanced Demo of IEEEtran.cls for IEEE Computer Society Journals" authored by Michael Shell et al., serves as a demonstrative example for preparing submissions in IEEE Computer Society journals using the IEEEtran LaTeX class file. The paper is essentially an indicative template, crafted to guide authors in structuring and formatting their manuscripts in compliance with the IEEE's publication standards.

Overview

IEEEtran.cls is a LaTeX class file that facilitates the typesetting of IEEE-style documents. This paper provides a skeletal representation when utilizing IEEEtran.cls version 1.8b and later for IEEE Computer Society journals. Its utility is primarily pedagogical, designed to ensure authors can adhere to the formatting requirements prescribed by the IEEE for consistent and professional presentation of scientific content.

Composition

The document encompasses several standard sections that would typically constitute a research paper. These include the title, authors, journal affiliations, and an abstract segment followed by keywords. However, the actual content of these sections in this document is nominal, serving as placeholders that authors can adapt to fit their specific work.

Key structural components such as the introduction, subsections, and conclusion are delineated to demonstrate the organizational flow required in academic manuscripts. The appendices section is reserved for supplementary information, while acknowledgments provide space to credit contributions and funding.

IEEEtran.cls offers robust functionalities, including control over bibliography styles and inclusion of an IEEE-compliant biography using the IEEEbiography and IEEEbiographynophoto environments, which are particularly beneficial for authors aiming to maintain uniformity across multiple sections.

Implications for Research and Publication

While the paper itself does not delve into new findings or advances in computer science, it underscores the importance of adhering to systematic formats for scholarly articles. Consistency in presentation facilitates readability and accessibility, a crucial aspect of scientific communication. By standardizing article structures, IEEEtran.cls aids in maintaining clarity and conformity in the dissemination of research, which is invaluable given the volume and diversity of outputs in contemporary scholarly publishing.

For practitioners and researchers accustomed to TeX-based typesetting, this document reaffirms the IEEE's commitment to providing flexible and powerful tools for document preparation. The availability of such templates can reduce the technical overhead of manuscript preparation, allowing authors to focus on the substance of their research rather than the complexities of formatting.

Future Prospects

As LaTeX continues to evolve along with digital publishing technologies, it is feasible that future iterations of IEEEtran.cls will incorporate additional functionalities, perhaps embracing innovations such as automated metadata tagging or compatibility with evolving preprint repositories and publishing protocols. The influence of open-access movements and the growing demand for interdisciplinary collaboration may also guide future enhancements to the IEEEtran class file.

In conclusion, "Bare Advanced Demo of IEEEtran.cls for IEEE Computer Society Journals" stands as a pertinent instructional resource for authors aiming to align with IEEE's publication norms, ensuring their work is presented in the most efficient and academically acceptable format.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. Z. Zhang, S. Fidler, and R. Urtasun, “Instance-level segmentation for autonomous driving with deep densely connected mrfs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 669–677.
  2. T. Zhou, F. Porikli, D. J. Crandall, L. Van Gool, and W. Wang, “A survey on deep learning technique for video segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 7099–7122, 2022.
  3. L. Yang, Y. Fan, and N. Xu, “Video instance segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5188–5197.
  4. J. Wu, Y. Jiang, S. Bai, W. Zhang, and X. Bai, “Seqformer: Sequential transformer for video instance segmentation,” in European Conference on Computer Vision.   Springer, 2022, pp. 553–569.
  5. Y. Weng, M. Han, H. He, M. Li, L. Yao, X. Chang, and B. Zhuang, “Mask propagation for efficient video semantic segmentation,” arXiv preprint arXiv:2310.18954, 2023.
  6. A. Athar, A. Hermans, J. Luiten, D. Ramanan, and B. Leibe, “Tarvis: A unified approach for target-based video segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 738–18 748.
  7. X. Li, H. Yuan, W. Zhang, G. Cheng, J. Pang, and C. C. Loy, “Tube-link: A flexible cross tube framework for universal video segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 13 923–13 933.
  8. D. Kim, S. Woo, J.-Y. Lee, and I. S. Kweon, “Video panoptic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9859–9868.
  9. Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia, “End-to-end video instance segmentation with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8741–8750.
  10. S. Hwang, M. Heo, S. W. Oh, and S. J. Kim, “Video instance segmentation using inter-frame communication transformers,” Advances in Neural Information Processing Systems, vol. 34, pp. 13 352–13 363, 2021.
  11. B. Cheng, A. Choudhuri, I. Misra, A. Kirillov, R. Girdhar, and A. G. Schwing, “Mask2former for video instance segmentation,” arXiv preprint arXiv:2112.10764, 2021.
  12. M. Heo, S. Hwang, S. W. Oh, J.-Y. Lee, and S. J. Kim, “Vita: Video instance segmentation via object token association,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 109–23 120, 2022.
  13. D.-A. Huang, Z. Yu, and A. Anandkumar, “Minvis: A minimal video instance segmentation framework without video-based training,” Advances in Neural Information Processing Systems, vol. 35, pp. 31 265–31 277, 2022.
  14. J. Wu, Q. Liu, Y. Jiang, S. Bai, A. Yuille, and X. Bai, “In defense of online models for video instance segmentation,” in European Conference on Computer Vision.   Springer, 2022, pp. 588–605.
  15. M. Heo, S. Hwang, J. Hyun, H. Kim, S. W. Oh, J.-Y. Lee, and S. J. Kim, “A generalized framework for video instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 623–14 632.
  16. K. Ying, Q. Zhong, W. Mao, Z. Wang, H. Chen, L. Y. Wu, Y. Liu, C. Fan, Y. Zhuge, and C. Shen, “Ctvis: Consistent training for online video instance segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 899–908.
  17. J. Qi, Y. Gao, Y. Hu, X. Wang, X. Liu, X. Bai, S. Belongie, A. Yuille, P. H. Torr, and S. Bai, “Occluded video instance segmentation: A benchmark,” International Journal of Computer Vision, vol. 130, no. 8, pp. 2022–2039, 2022.
  18. B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” Advances in Neural Information Processing Systems, vol. 34, pp. 17 864–17 875, 2021.
  19. B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1290–1299.
  20. F. Li, H. Zhang, H. Xu, S. Liu, L. Zhang, L. M. Ni, and H.-Y. Shum, “Mask dino: Towards a unified transformer-based framework for object detection and segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3041–3050.
  21. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  22. T. Zhang, X. Tian, Y. Wu, S. Ji, X. Wang, Y. Zhang, and P. Wan, “Dvis: Decoupled video instance segmentation framework,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 1282–1291.
  23. M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
  24. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  25. J. Miao, X. Wang, Y. Wu, W. Li, X. Zhang, Y. Wei, and Y. Yang, “Large-scale video panoptic segmentation in the wild: A benchmark,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21 033–21 043.
  26. J. Miao, Y. Wei, Y. Wu, C. Liang, G. Li, and Y. Yang, “Vspw: A large-scale dataset for video scene parsing in the wild,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4133–4143.
  27. T. Zhang, X. Tian, H. Wei, Y. Wu, S. Ji, X. Wang, Y. Zhang, and P. Wan, “1st place solution for pvuw challenge 2023: Video panoptic segmentation,” arXiv preprint arXiv:2306.04091, 2023.
  28. T. Zhang, X. Tian, Y. Zhou, Y. Wu, S. Ji, C. Yan, X. Wang, X. Tao, Y. Zhang, and P. Wan, “1st place solution for the 5th lsvos challenge: Video instance segmentation,” arXiv preprint arXiv:2308.14392, 2023.
  29. X. Bai, J. Wang, D. Simons, and G. Sapiro, “Video snapcut: robust video object cutout using localized classifiers,” ACM Transactions on Graphics (ToG), vol. 28, no. 3, pp. 1–11, 2009.
  30. X. Bai and G. Sapiro, “Geodesic matting: A framework for fast interactive image and video segmentation and matting,” International journal of computer vision, vol. 82, pp. 113–132, 2009.
  31. Y. Mu, H. Zhang, H. Wang, and W. Zuo, “Automatic video object segmentation using graph cut,” in 2007 IEEE International Conference on Image Processing, vol. 3.   IEEE, 2007, pp. III–377.
  32. E. Shelhamer, K. Rakelly, J. Hoffman, and T. Darrell, “Clockwork convnets for video semantic segmentation,” in Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14.   Springer, 2016, pp. 852–868.
  33. G. Sun, Y. Liu, H. Ding, T. Probst, and L. Van Gool, “Coarse-to-fine feature mining for video semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3126–3137.
  34. G. Sun, Y. Liu, H. Tang, A. Chhatkuli, L. Zhang, and L. Van Gool, “Mining relations among cross-frame affinities for video semantic segmentation,” in European Conference on Computer Vision.   Springer, 2022, pp. 522–539.
  35. X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei, “Deep feature flow for video recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2349–2358.
  36. K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
  37. J. Cao, R. M. Anwer, H. Cholakkal, F. S. Khan, Y. Pang, and L. Shao, “Sipmask: Spatial information preservation for fast image and video instance segmentation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16.   Springer, 2020, pp. 1–18.
  38. S. Yang, Y. Fang, X. Wang, Y. Li, C. Fang, Y. Shan, B. Feng, and W. Liu, “Crossover learning for fast online video instance segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8043–8052.
  39. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision.   Springer, 2020, pp. 213–229.
  40. T. Hannan, R. Koner, M. Bernhard, S. Shit, B. Menze, V. Tresp, M. Schubert, and T. Seidl, “Gratt-vis: Gated residual attention for auto rectifying video instance segmentation,” arXiv preprint arXiv:2305.17096, 2023.
  41. T. Meinhardt, M. Feiszli, Y. Fan, L. Leal-Taixe, and R. Ranjan, “Novis: A case for end-to-end near-online video instance segmentation,” arXiv preprint arXiv:2308.15266, 2023.
  42. H. Wang, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “Max-deeplab: End-to-end panoptic segmentation with mask transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5463–5474.
  43. W. Zhang, J. Pang, K. Chen, and C. C. Loy, “K-net: Towards unified image segmentation,” Advances in Neural Information Processing Systems, vol. 34, pp. 10 326–10 338, 2021.
  44. Q. Yu, H. Wang, S. Qiao, M. Collins, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “k-means mask transformer,” in European Conference on Computer Vision.   Springer, 2022, pp. 288–307.
  45. X. Li, W. Zhang, J. Pang, K. Chen, G. Cheng, Y. Tong, and C. C. Loy, “Video k-net: A simple, strong, and unified baseline for video segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 847–18 857.
  46. D. Kim, J. Xie, H. Wang, S. Qiao, Q. Yu, H.-S. Kim, H. Adam, I. S. Kweon, and L.-C. Chen, “Tubeformer-deeplab: Video mask transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 914–13 924.
  47. H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” arXiv preprint arXiv:2203.03605, 2022.
  48. Z. Tian, C. Shen, and H. Chen, “Conditional convolutions for instance segmentation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16.   Springer, 2020, pp. 282–298.
  49. D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “Yolact: Real-time instance segmentation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9157–9166.
  50. J. Jain, J. Li, M. T. Chiu, A. Hassani, N. Orlov, and H. Shi, “Oneformer: One transformer to rule universal image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2989–2998.
  51. H. Zhang, F. Li, X. Zou, S. Liu, C. Li, J. Gao, J. Yang, and L. Zhang, “A simple framework for open-vocabulary segmentation and detection,” arXiv preprint arXiv:2303.08131, 2023.
  52. F. Li, H. Zhang, P. Sun, X. Zou, S. Liu, J. Yang, C. Li, L. Zhang, and J. Gao, “Semantic-sam: Segment and recognize anything at any granularity,” arXiv preprint arXiv:2307.04767, 2023.
  53. G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin, “Scaling open-vocabulary image segmentation with image-level labels,” in European Conference on Computer Vision.   Springer, 2022, pp. 540–557.
  54. M. Xu, Z. Zhang, F. Wei, Y. Lin, Y. Cao, H. Hu, and X. Bai, “A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model,” in European Conference on Computer Vision.   Springer, 2022, pp. 736–753.
  55. F. Liang, B. Wu, X. Dai, K. Li, Y. Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask-adapted clip,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7061–7070.
  56. J. Ding, N. Xue, G.-S. Xia, and D. Dai, “Decoupling zero-shot semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 583–11 592.
  57. J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 134–18 144.
  58. C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” in European Conference on Computer Vision.   Springer, 2022, pp. 696–712.
  59. X. Zou, Z.-Y. Dou, J. Yang, Z. Gan, L. Li, C. Li, X. Dai, H. Behl, J. Wang, L. Yuan et al., “Generalized decoding for pixel, image, and language,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 116–15 127.
  60. C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International conference on machine learning.   PMLR, 2021, pp. 4904–4916.
  61. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
  62. Y. Rao, W. Zhao, G. Chen, Y. Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu, “Denseclip: Language-guided dense prediction with context-aware prompting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 082–18 091.
  63. J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, and S. De Mello, “Open-vocabulary panoptic segmentation with text-to-image diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2955–2966.
  64. Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen, “Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,” in NeurIPS, 2023.
  65. H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
  66. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  67. Z. Chen, Y. Duan, W. Wang, J. He, T. Lu, J. Dai, and Y. Qiao, “Vision transformer adapter for dense predictions,” arXiv preprint arXiv:2205.08534, 2022.
  68. F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 2016 fourth international conference on 3D vision (3DV).   Ieee, 2016, pp. 565–571.
  69. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  70. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.
  71. S. H. Han, S. Hwang, S. W. Oh, Y. Park, H. Kim, M.-J. Kim, and S. J. Kim, “Visolo: Grid-based space-time aggregation for efficient online video instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2896–2905.
  72. J. Li, B. Yu, Y. Rao, J. Zhou, and J. Lu, “Tcovis: Temporally consistent online video instance segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 1097–1107.
  73. B. Yan, Y. Jiang, J. Wu, D. Wang, P. Luo, Z. Yuan, and H. Lu, “Universal instance perception as object discovery and retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 325–15 336.
  74. A. Abrantes, J. Wang, P. Chu, Q. You, and Z. Liu, “Refinevis: Video instance segmentation with temporal attention refinement,” arXiv preprint arXiv:2306.04774, 2023.
  75. J. Wu, S. Yarram, H. Liang, T. Lan, J. Yuan, J. Eledath, and G. Medioni, “Efficient video instance segmentation via tracklet query and proposal,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 959–968.
  76. M. Li, S. Li, W. Xiang, and L. Zhang, “Mdqe: Mining discriminative query embeddings to segment occluded instances on challenging videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 524–10 533.
  77. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  78. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
  79. I. Shin, D. Kim, Q. Yu, J. Xie, H.-S. Kim, B. Green, I. S. Kweon, K.-J. Yoon, and L.-C. Chen, “Video-kmax: A simple unified approach for online and near-online video panoptic segmentation,” arXiv preprint arXiv:2304.04694, 2023.
  80. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018.
  81. S. Woo, D. Kim, J.-Y. Lee, and I. S. Kweon, “Learning to associate every segment for video panoptic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2705–2714.
  82. S. Qiao, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “Vip-deeplab: Learning visual perception with depth-aware video panoptic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3997–4008.
  83. X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra, “Detecting twenty-thousand classes using image-level supervision,” in European Conference on Computer Vision.   Springer, 2022, pp. 350–368.
  84. A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in 2016 IEEE international conference on image processing (ICIP).   IEEE, 2016, pp. 3464–3468.
  85. Y. Liu, I. E. Zulfikar, J. Luiten, A. Dave, D. Ramanan, B. Leibe, A. Ošep, and L. Leal-Taixé, “Opening up open world tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 045–19 055.
  86. H. Wang, C. Yan, S. Wang, X. Jiang, X. Tang, Y. Hu, W. Xie, and E. Gavves, “Towards open-vocabulary video instance segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4057–4066.
  87. A. Gupta, P. Dollar, and R. Girshick, “Lvis: A dataset for large vocabulary instance segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5356–5364.
  88. Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 976–11 986.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Tao Zhang (481 papers)
  2. Xingye Tian (6 papers)
  3. Yikang Zhou (7 papers)
  4. Shunping Ji (23 papers)
  5. Xuebo Wang (6 papers)
  6. Xin Tao (50 papers)
  7. Yuan Zhang (331 papers)
  8. Pengfei Wan (86 papers)
  9. Zhongyuan Wang (105 papers)
  10. Yu Wu (196 papers)
Citations (11)
Github Logo Streamline Icon: https://streamlinehq.com