Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation (2306.08736v3)

Published 14 Jun 2023 in cs.CV

Abstract: Referring video object segmentation (RVOS) aims to segment the target instance referred by a given text expression in a video clip. The text expression normally contains sophisticated description of the instance's appearance, action, and relation with others. It is therefore rather difficult for a RVOS model to capture all these attributes correspondingly in the video; in fact, the model often favours more on the action- and relation-related visual attributes of the instance. This can end up with partial or even incorrect mask prediction of the target instance. We tackle this problem by taking a subject-centric short text expression from the original long text expression. The short one retains only the appearance-related information of the target instance so that we can use it to focus the model's attention on the instance's appearance. We let the model make joint predictions using both long and short text expressions; and insert a long-short cross-attention module to interact the joint features and a long-short predictions intersection loss to regulate the joint predictions. Besides the improvement on the linguistic part, we also introduce a forward-backward visual consistency loss, which utilizes optical flows to warp visual features between the annotated frames and their temporal neighbors for consistency. We build our method on top of two state of the art pipelines. Extensive experiments on A2D-Sentences, Refer-YouTube-VOS, JHMDB-Sentences and Refer-DAVIS17 show impressive improvements of our method.Code is available at https://github.com/LinfengYuan1997/Losh.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. A closer look at referring expressions for video object segmentation. Multimedia Tools and Applications, 82(3):4419–4438, 2023.
  2. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009.
  3. End-to-end referring video object segmentation with multimodal transformers. In CVPR, 2022.
  4. Language models are few-shot learners. NeurIPS, 2020.
  5. End-to-end object detection with transformers. In ECCV, 2020.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  7. Every frame counts: Joint learning of video segmentation and optical flow. In AAAI, 2020.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  9. Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. In SCIA, 2003.
  10. Actor and action video segmentation from a sentence. In CVPR, 2018.
  11. Sotr: Segmenting objects with transformers. In ICCV, 2021.
  12. Html: Hybrid temporal-scale multimodal learning framework for referring video object segmentation. In ICCV, 2023.
  13. Collaborative spatial-temporal modeling for language-queried video actor segmentation. In CVPR, 2021.
  14. Towards understanding action recognition. In ICCV, 2013.
  15. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  16. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
  17. Video object segmentation with language referring expressions. In ACCV, 2018.
  18. You only infer once: Cross-modal meta-transfer for referring video object segmentation. In AAAI, 2022.
  19. Clawcranenet: Leveraging object-level relation for text-based video segmentation. arXiv preprint arXiv:2103.10702, 2021a.
  20. Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation. arXiv preprint arXiv:2106.01061, 2021b.
  21. Cross-modal progressive comprehension for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4761–4775, 2021.
  22. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  23. Video swin transformer. In CVPR, 2022.
  24. Soc: Semantic-assisted object cluster for referring video object segmentation. NeurIPS, 2024.
  25. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016a.
  26. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016b.
  27. Spectrum-guided multi-granularity referring video object segmentation. In ICCV, 2023.
  28. Semantic video segmentation by gated recurrent flow propagation. In CVPR, 2018.
  29. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  30. Learning transferable visual models from natural language supervision. In ICML, 2021.
  31. Urvos: Unified referring video object segmentation network with a large-scale benchmark. In ECCV, 2020.
  32. On the integration of optical flow and action recognition. In GCPR, 2019.
  33. Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
  34. Optical flow guided feature: A fast and robust motion representation for video action recognition. In CVPR, 2018.
  35. Temporal collection and distribution for referring video object segmentation. In ICCV, 2023.
  36. Conditional convolutions for instance segmentation. In ECCV, 2020.
  37. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847, 2018.
  38. Video segmentation via object flow. In CVPR, 2016.
  39. Attention is all you need. NeurIPS, 2017.
  40. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In CVPR, 2021a.
  41. End-to-end video instance segmentation with transformers. In CVPR, 2021b.
  42. Multi-level representation learning with semantic alignment for referring video object segmentation. In CVPR, 2022a.
  43. Onlinerefer: A simple online baseline for referring video object segmentation. In ICCV, 2023.
  44. Language as queries for referring video object segmentation. In CVPR, 2022b.
  45. Actor and action modular network for text-based video segmentation. IEEE Transactions on Image Processing, 31:4474–4489, 2022.
  46. Referring segmentation in images and videos with cross-modal self-attention network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3719–3732, 2021.
  47. Modeling context in referring expressions. In ECCV, 2016.
  48. Modeling motion with multi-modal features for text-based video segmentation. In CVPR, 2022.
  49. Hilo: Exploiting high low frequency relations for unbiased panoptic scene graph generation. In ICCV, 2023a.
  50. Vlprompt: Vision-language prompting for panoptic scene graph generation. arXiv preprint arXiv:2311.16492, 2023b.
  51. Text promptable surgical instrument segmentation with vision-language models. NeurIPS, 2024.
  52. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
  53. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ICCV, 2015.
Citations (4)

Summary

We haven't generated a summary for this paper yet.