1st Place Solution for 5th LSVOS Challenge: Referring Video Object Segmentation (2401.00663v1)
Abstract: The recent transformer-based models have dominated the Referring Video Object Segmentation (RVOS) task due to the superior performance. Most prior works adopt unified DETR framework to generate segmentation masks in query-to-instance manner. In this work, we integrate strengths of that leading RVOS models to build up an effective paradigm. We first obtain binary mask sequences from the RVOS models. To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy. Each stage rationally ensembles RVOS models based on framework design as well as training strategy, and leverages different video object segmentation (VOS) models to enhance mask coherence by object propagation mechanism. Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Segmentation Challenge (ICCV 2023) track 3. Code is available at https://github.com/RobertLuo1/iccv2023_RVOS_Challenge.
- Refvos: A closer look at referring expressions for video object segmentation. CoRR, abs/2010.00263, 2020.
- End-to-end referring video object segmentation with multimodal transformers. In CVPR, pages 4975–4985, 2022.
- Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In NIPS, pages 11781–11794, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Actor and action video segmentation from a sentence. In CVPR, pages 5958–5966, 2018.
- 1st place solution for youtubevos challenge 2022: Referring video object segmentation. arXiv preprint arXiv:2212.14679, 2022.
- Video object segmentation with language referring expressions. In ACCV, volume 11364, pages 123–141. Springer, 2018.
- Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation. CoRR, abs/2106.01061, 2021.
- SOC: semantic-assisted object cluster for referring video object segmentation. CoRR, abs/2305.17011, 2023.
- Video object segmentation using space-time memory networks. In ICCV, pages 9226–9235, 2019.
- URVOS: unified referring video object segmentation network with a large-scale benchmark. In ECCV, pages 208–223, 2020.
- Feelvos: Fast end-to-end embedding learning for video object segmentation. In CVPR, pages 9481–9490, 2019.
- Context modulated dynamic networks for actor and action video segmentation with language queries. In AAAI, pages 12152–12159, 2020.
- Language as queries for referring video object segmentation. In CVPR, pages 4964–4974. IEEE, 2022.
- Universal instance perception as object discovery and retrieval. In CVPR, 2023.
- Referred by multi-modality: A unified temporal transformer for video object segmentation. CoRR, abs/2305.16318, 2023.
- Associating objects with transformers for video object segmentation. NIPS, 34:2491–2502, 2021.
- Decoupling features in hierarchical propagation for video object segmentation. NIPS, 35:36324–36336, 2022.
- Discriminative bimodal networks for visual localization and detection with natural language queries. In CVPR, pages 1090–1099, 2017.
- Zhuoyan Luo (7 papers)
- Yicheng Xiao (17 papers)
- Yong Liu (721 papers)
- Yitong Wang (47 papers)
- Yansong Tang (81 papers)
- Xiu Li (166 papers)
- Yujiu Yang (155 papers)