Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scalable Video Object Segmentation with Identification Mechanism (2203.11442v8)

Published 22 Mar 2022 in cs.CV

Abstract: This paper delves into the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object Segmentation (VOS). Previous VOS methods decode features with a single positive object, limiting the learning of multi-object representation as they must match and segment each target separately under multi-object scenarios. Additionally, earlier techniques catered to specific application objectives and lacked the flexibility to fulfill different speed-accuracy requirements. To address these problems, we present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST). In pursuing effective multi-object modeling, AOT introduces the IDentification (ID) mechanism to allocate each object a unique identity. This approach enables the network to model the associations among all objects simultaneously, thus facilitating the tracking and segmentation of objects in a single network pass. To address the challenge of inflexible deployment, AOST further integrates scalable long short-term transformers that incorporate scalable supervision and layer-wise ID-based attention. This enables online architecture scalability in VOS for the first time and overcomes ID embeddings' representation limitations. Given the absence of a benchmark for VOS involving densely multi-object annotations, we propose a challenging Video Object Segmentation in the Wild (VOSW) benchmark to validate our approaches. We evaluated various AOT and AOST variants using extensive experiments across VOSW and five commonly used VOS benchmarks, including YouTube-VOS 2018 & 2019 Val, DAVIS-2017 Val & Test, and DAVIS-2016. Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks. Project page: https://github.com/yoxu515/aot-benchmark.

Insights into Scalable Video Object Segmentation with Identification Mechanism

The paper "Scalable Video Object Segmentation with Identification Mechanism" focuses on enhancing the efficiency and flexibility of Video Object Segmentation (VOS) through innovative modeling approaches. Traditional VOS techniques often struggle with the effective segmentation of multiple objects simultaneously, as each object must be processed independently. This introduces considerable computational inefficiencies and hampers scalability across different application domains. To address these limitations, the authors propose two novel methodologies: Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST).

Key Contributions

The paper introduces the IDentification (ID) mechanism, a pivotal component that assigns unique identities to objects within video frames. This allows for simultaneous multi-object modeling, enhancing both the representation and efficiency of segmentation tasks. The AOT model implements this mechanism, enabling end-to-end processing of objects in a single network pass, thus reducing computational demands and improving context understanding.

Further advancing the flexibility of VOS deployments, the AOST integrates scalable long short-term transformers, incorporating layer-wise ID-based attention. This method allows architectural adjustments at runtime, addressing varying speed-accuracy trade-offs and enhancing applicability across devices with differing capabilities, such as mobile phones and high-performance servers.

Empirical Evaluation

To substantiate the efficacy of their approaches, the authors introduce the Video Object Segmentation in the Wild (VOSW) benchmark, featuring densely annotated multi-object scenarios. The experiments conducted across VOSW and five established VOS benchmarks—such as YouTube-VOS and DAVIS—demonstrate that AOT and AOST consistently outperform state-of-the-art methods. Notable is their first-place ranking in the third Large-scale Video Object Segmentation Challenge, highlighting their advancements in scalability and efficiency.

Implications and Future Work

The implications of this paper are significant both for theoretical advancements in multi-object VOS and practical deployments in real-time applications. The introduction of the identification mechanism and scalable transformers holds promise for broader adoption in areas such as autonomous driving, augmented reality, and video editing, where multi-object tracking is imperative.

Future work could extend these methodologies to related domains such as video instance segmentation or interactive VOS, where similar scalability and efficiency challenges persist. Additionally, exploring the integration of these methods with more advanced backbone architectures or novel attention mechanisms could further elevate VOS capabilities.

Conclusions

Overall, the paper contributes effectively to the field of VOS by addressing multi-object modeling limitations through novel identification and scalability approaches. The proposed frameworks not only enhance computational efficiency but also provide practical solutions for diverse application requirements. The comprehensive benchmarks and strong empirical results establish a foundational path for the evolution and deployment of scalable VOS systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (109)
  1. T. Zhou, F. Porikli, D. J. Crandall, L. Van Gool, and W. Wang, “A survey on deep learning technique for video segmentation,” TPAMI, 2022.
  2. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
  3. S. W. Oh, J.-Y. Lee, N. Xu, and S. J. Kim, “Video object segmentation using space-time memory networks,” in ICCV, 2019.
  4. H. Seong, J. Hyun, and E. Kim, “Kernelized memory network for video object segmentation,” in ECCV, 2020.
  5. X. Lu, W. Wang, M. Danelljan, T. Zhou, J. Shen, and L. Van Gool, “Video object segmentation with episodic graph memory networks,” in ECCV, 2020.
  6. H. K. Cheng, Y.-W. Tai, and C.-K. Tang, “Rethinking space-time networks with improved memory coverage for efficient video object segmentation,” in NeurIPS, 2021.
  7. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in CVPR, 2018, pp. 7794–7803.
  8. Z. Yang, Y. Wei, and Y. Yang, “Collaborative video object segmentation by foreground-background integration,” in ECCV, 2020.
  9. Z. Yang, Y. Wei, and Y. Yang ​​, “Collaborative video object segmentation by multi-scale foreground-background integration,” TPAMI, vol. 44, no. 9, pp. 4701–4712, 2021.
  10. X. Xu, J. Wang, X. Li, and Y. Lu, “Reliable propagation-correction modulation for video object segmentation,” in AAAI, 2022.
  11. S. Cho, H. Lee, M. Lee, C. Park, S. Jang, M. Kim, and S. Lee, “Tackling background distraction in video object segmentation,” in ECCV, 2022, pp. 446–462.
  12. P. Voigtlaender, Y. Chai, F. Schroff, H. Adam, B. Leibe, and L.-C. Chen, “Feelvos: Fast end-to-end embedding learning for video object segmentation,” in CVPR, 2019, pp. 9481–9490.
  13. R. Miles, M. K. Yucel, B. Manganelli, and A. Saà-Garriga, “Mobilevos: Real-time video object segmentation contrastive learning meets knowledge distillation,” in CVPR, 2023, pp. 10 480–10 490.
  14. Y. Li, Z. Shen, and Y. Shan, “Fast video object segmentation using the global context module,” in ECCV, 2020, pp. 735–750.
  15. X. Chen, Z. Li, Y. Yuan, G. Yu, J. Shen, and D. Qi, “State-aware tracker for real-time video object segmentation,” in CVPR, 2020, pp. 9384–9393.
  16. Y. Liang, X. Li, N. Jafari, and J. Chen, “Video object segmentation with adaptive feature bank and uncertain-region refinement,” in NeurIPS, vol. 33, 2020, pp. 3430–3441.
  17. Z. Yang, Y. Wei, and Y. Yang, “Associating objects with transformers for video object segmentation,” in NeurIPS, 2021.
  18. J. Miao, X. Wang, Y. Wu, W. Li, X. Zhang, Y. Wei, and Y. Yang, “Large-scale video panoptic segmentation in the wild: A benchmark,” in CVPR, 2022, pp. 21 033–21 043.
  19. N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang, “Youtube-vos: A large-scale video object segmentation benchmark,” arXiv preprint arXiv:1809.03327, 2018.
  20. J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool, “The 2017 davis challenge on video object segmentation,” arXiv preprint arXiv:1704.00675, 2017.
  21. F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in CVPR, 2016, pp. 724–732.
  22. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in CVPR, 2018, pp. 4510–4520.
  23. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
  24. Z. Yang, J. Zhang, W. Wang, W. Han, Y. Yu, Y. Li, J. Wang, Y. Wei, Y. Sun, and Y. Yang, “Towards multi-object association from foreground-background integration,” in CVPR Workshops, 2021.
  25. F. Zhu, Z. Yang, X. Yu, Y. Yang, and Y. Wei, “Instance as identity: A generic online paradigm for video instance segmentation,” in ECCV, 2022, pp. 524–540.
  26. Y. Cheng, L. Li, Y. Xu, X. Li, Z. Yang, W. Wang, and Y. Yang, “Segment and track anything,” arXiv preprint arXiv:2305.06558, 2023.
  27. Z. Yang and Y. Yang, “Decoupling features in hierarchical propagation for video object segmentation,” in NeurIPS, 2022.
  28. Y. Xu, Z. Yang, and Y. Yang, “Integrating boxes and masks: A multi-object framework for unified visual tracking and segmentation,” in ICCV, 2023, pp. 9738–9751.
  29. Z. Yang, G. Chen, X. Li, W. Wang, and Y. Yang, “Doraemongpt: Toward understanding dynamic scenes with large language models,” 2024.
  30. K. Li, Z. Yang, L. Chen, Y. Yang, and J. Xiao, “Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1485–1494.
  31. Y. Xu, Z. Yang, and Y. Yang ​​, “Video object segmentation in panoptic wild scenes,” in IJCAI, 2023.
  32. Y. Yu, J. Yuan, G. Mittal, L. Fuxin, and M. Chen, “Batman: Bilateral attention transformer in motion-appearance neighboring space for video object segmentation,” in ECCV, 2022, pp. 612–629.
  33. T.-X. Xu, Y.-C. Guo, Y.-K. Lai, and S.-H. Zhang, “Mbptrack: Improving 3d point cloud tracking with memory networks and box priors,” arXiv preprint arXiv:2303.05071, 2023.
  34. P. Tokmakov, J. Li, and A. Gaidon, “Breaking the” object” in video object segmentation,” in CVPR, 2023, pp. 22 836–22 845.
  35. C. Mayer, M. Danelljan, M.-H. Yang, V. Ferrari, L. Van Gool, and A. Kuznetsova, “Beyond sot: It’s time to track multiple generic objects at once,” in WACV, 2024.
  36. L. Chen, J. Shen, W. Wang, and B. Ni, “Video object segmentation via dense trajectories,” TMM, vol. 17, no. 12, pp. 2225–2234, 2015.
  37. C. Liang, W. Wang, T. Zhou, J. Miao, Y. Luo, and Y. Yang, “Local-global context aware transformer for language-guided video segmentation,” TPAMI, 2023.
  38. V. Badrinarayanan, F. Galasso, and R. Cipolla, “Label propagation in video sequences,” in CVPR, 2010, pp. 3265–3272.
  39. S. Vijayanarasimhan and K. Grauman, “Active frame selection for label propagation in videos,” in ECCV, 2012, pp. 496–509.
  40. S. Avinash Ramakanth and R. Venkatesh Babu, “Seamseg: Video object segmentation using patch seams,” in CVPR, 2014, pp. 376–383.
  41. S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool, “One-shot video object segmentation,” in CVPR, 2017, pp. 221–230.
  42. H. Xiao, J. Feng, G. Lin, Y. Liu, and M. Zhang, “Monet: Deep motion exploitation for video object segmentation,” in CVPR, 2018, pp. 1140–1148.
  43. P. Voigtlaender and B. Leibe, “Online adaptation of convolutional neural networks for video object segmentation,” in BMVC, 2017.
  44. F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung, “Learning video object segmentation from static images,” in CVPR, 2017, pp. 2663–2672.
  45. J. Luiten, P. Voigtlaender, and B. Leibe, “Premvos: Proposal-generation, refinement and merging for video object segmentation,” in ACCV, 2018, pp. 565–580.
  46. L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos, “Efficient video object segmentation via network modulation,” in CVPR, 2018, pp. 6499–6507.
  47. Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool, “Blazingly fast video object segmentation with pixel-wise metric learning,” in CVPR, 2018, pp. 1189–1198.
  48. Y.-T. Hu, J.-B. Huang, and A. G. Schwing, “Videomatch: Matching based video object segmentation,” in ECCV, 2018, pp. 54–70.
  49. Z. Yang, P. Li, Q. Feng, Y. Wei, and Y. Yang, “Going deeper into embedding learning for video object segmentation,” in ICCV Workshops, 2019, pp. 0–0.
  50. Z. Yang, Y. Ding, Y. Wei, and Y. Yang, “Cfbi+: Collaborative video object segmentation by multi-scale foreground-background integration,” in CVPR Workshops, vol. 1, no. 2, 2020, p. 3.
  51. S. Wug Oh, J.-Y. Lee, K. Sunkavalli, and S. Joo Kim, “Fast video object segmentation by reference-guided mask propagation,” in CVPR, 2018, pp. 7376–7385.
  52. G. Bhat, F. J. Lawin, M. Danelljan, A. Robinson, M. Felsberg, L. Van Gool, and R. Timofte, “Learning what to learn for video object segmentation,” in ECCV, 2020.
  53. Y. Mao, N. Wang, W. Zhou, and H. Li, “Joint inductive and transductive learning for video object segmentation,” in ICCV, 2021, pp. 9670–9679.
  54. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in ICLR, 2015.
  55. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
  56. H. Seong, J. Hyun, and E. Kim, “Video object segmentation using kernelized memory network with multiple kernels,” TPAMI, 2022.
  57. H. K. Cheng and A. G. Schwing, “Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model,” in ECCV, 2022, pp. 640–658.
  58. Y. Zhang, L. Li, W. Wang, R. Xie, L. Song, and W. Zhang, “Boosting video object segmentation via space-time correspondence learning,” in CVPR, 2023, pp. 2246–2256.
  59. M. Li, L. Hu, Z. Xiong, B. Zhang, P. Pan, and D. Liu, “Recurrent dynamic embedding for video object segmentation,” in CVPR, 2022, pp. 1332–1341.
  60. J. Wang, D. Chen, Z. Wu, C. Luo, C. Tang, X. Dai, Y. Zhao, Y. Xie, L. Yuan, and Y.-G. Jiang, “Look before you match: Instance understanding matters in video object segmentation,” in CVPR, 2023, pp. 2268–2278.
  61. W. Wang, J. Shen, F. Porikli, and R. Yang, “Semi-supervised video object segmentation with super-trajectories,” TPAMI, vol. 41, no. 4, pp. 985–998, 2018.
  62. S. Teerapittayanon, B. McDanel, and H.-T. Kung, “Branchynet: Fast inference via early exiting from deep neural networks,” in ICPR.   IEEE, 2016, pp. 2464–2469.
  63. T. Bolukbasi, J. Wang, O. Dekel, and V. Saligrama, “Adaptive neural networks for efficient inference,” in ICML.   PMLR, 2017, pp. 527–536.
  64. J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang, “Slimmable neural networks,” in ICLR, 2018.
  65. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” in NeurIPS Workshops, 2016.
  66. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019, pp. 4171––4186.
  67. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  68. G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V. Pratap, A. Sriram, V. Liptchinsky, and R. Collobert, “End-to-end asr: from supervised to semi-supervised learning with modern architectures,” in ICML Workshops, 2020.
  69. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
  70. A. Vaswani, P. Ramachandran, A. Srinivas, N. Parmar, B. Hechtman, and J. Shlens, “Scaling local self-attention for parameter efficient visual backbones,” in CVPR, 2021, pp. 12 894–12 904.
  71. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, 2020, pp. 213–229.
  72. Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia, “End-to-end video instance segmentation with transformers,” in CVPR, 2021, pp. 8741–8750.
  73. N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran, “Image transformer,” in ICCV, 2018, pp. 4055–4064.
  74. A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” in ICCV, 2021, pp. 6836–6846.
  75. L. Zhu, H. Fan, Y. Luo, M. Xu, and Y. Yang, “Temporal cross-layer correlation mining for action recognition,” TMM, vol. 24, pp. 668–676, 2021.
  76. Y. Lu, F. Ni, H. Wang, X. Guo, L. Zhu, Z. Yang, R. Song, L. Cheng, and Y. Yang, “Show me a video: A large-scale narrated video dataset for coherent story illustration,” TMM, 2023.
  77. X. Wang, L. Zhu, Z. Zheng, M. Xu, and Y. Yang, “Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision,” 2022.
  78. G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” in ICML, vol. 2, no. 3, 2021, p. 4.
  79. H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, “Multiscale vision transformers,” in ICCV, 2021, pp. 6824–6835.
  80. Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in CVPR, 2022, pp. 3202–3211.
  81. N. Wang, W. Zhou, J. Wang, and H. Li, “Transformer meets tracker: Exploiting temporal context for robust visual tracking,” in CVPR, 2021, pp. 1571–1580.
  82. X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” in CVPR, 2021, pp. 8126–8135.
  83. B. Yan, Y. Jiang, J. Wu, D. Wang, P. Luo, Z. Yuan, and H. Lu, “Universal instance perception as object discovery and retrieval,” arXiv preprint arXiv:2303.06674, 2023.
  84. B. Yan, Y. Jiang, P. Sun, D. Wang, Z. Yuan, P. Luo, and H. Lu, “Towards grand unification of object tracking,” in ECCV, 2022, pp. 733–751.
  85. T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer, “Trackformer: Multi-object tracking with transformers,” in CVPR, 2022, pp. 8844–8854.
  86. P. Chu, J. Wang, Q. You, H. Ling, and Z. Liu, “Transmot: Spatial-temporal graph transformer for multiple object tracking,” in WACV, 2023, pp. 4870–4880.
  87. H. Lin, X. Qi, and J. Jia, “Agss-vos: Attention guided single-shot video object segmentation,” in ICCV, 2019, pp. 3949–3957.
  88. Y. Liu, R. Yu, J. Wang, X. Zhao, Y. Wang, Y. Tang, and Y. Yang, “Global spectral filter memory network for video object segmentation,” in ECCV, 2022, pp. 648–665.
  89. Y. Liu, R. Yu, F. Yin, X. Zhao, W. Zhao, W. Xia, and Y. Yang, “Learning quality-aware dynamic memory for video object segmentation,” in ECCV, 2022, pp. 468–486.
  90. D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
  91. S. Nowozin, “Optimal decisions from probabilistic models: the intersection-over-union case,” in CVPR, 2014, pp. 548–555.
  92. Z. Yang, L. Zhu, Y. Wu, and Y. Yang, “Gated channel transformation for visual recognition,” in CVPR, 2020.
  93. T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in CVPR, 2017, pp. 2117–2125.
  94. Y. Wu and K. He, “Group normalization,” in ECCV, 2018, pp. 3–19.
  95. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  96. M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” IJCV, vol. 88, no. 2, pp. 303–338, 2010.
  97. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014, pp. 740–755.
  98. M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu, “Global contrast based salient region detection,” TPAMI, vol. 37, no. 3, pp. 569–582, 2014.
  99. J. Shi, Q. Yan, L. Xu, and J. Jia, “Hierarchical image saliency detection on extended cssd,” TPAMI, vol. 38, no. 4, pp. 717–729, 2015.
  100. B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik, “Semantic contours from inverse detectors,” in ICCV, 2011, pp. 991–998.
  101. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in ICLR, 2019.
  102. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015.
  103. B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approximation by averaging,” SIAM journal on control and optimization, vol. 30, no. 4, pp. 838–855, 1992.
  104. G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” in ECCV, 2016, pp. 646–661.
  105. W. Wang, M. Feiszli, H. Wang, and D. Tran, “Unidentified video objects: A benchmark for dense, open-world segmentation,” in ICCV, 2021, pp. 10 776–10 785.
  106. K. Park, S. Woo, S. W. Oh, I. S. Kweon, and J.-Y. Lee, “Per-clip video object segmentation,” in CVPR, 2022, pp. 1352–1361.
  107. B. Duke, A. Ahmed, C. Wolf, P. Aarabi, and G. W. Taylor, “Sstvos: Sparse spatiotemporal transformers for video object segmentation,” in CVPR, 2021.
  108. H. Seong, S. W. Oh, J.-Y. Lee, S. Lee, S. Lee, and E. Kim, “Hierarchical memory matching network for video object segmentation,” in ICCV, 2021, pp. 12 889–12 898.
  109. M. Lan, J. Zhang, L. Zhang, and D. Tao, “Learning to learn better for video object segmentation,” in AAAI, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zongxin Yang (51 papers)
  2. Jiaxu Miao (15 papers)
  3. Yunchao Wei (151 papers)
  4. Wenguan Wang (103 papers)
  5. Xiaohan Wang (91 papers)
  6. Yi Yang (856 papers)
Citations (16)