Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation (2305.16318v2)

Published 25 May 2023 in cs.CV, cs.AI, and cs.MM

Abstract: Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. Firstly, for low-level temporal aggregation before the transformer, we enable the multi-modal references to capture multi-scale visual cues from consecutive video frames. This effectively endows the text or audio signals with temporal knowledge and boosts the semantic alignment between modalities. Secondly, for high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video. On Ref-YouTube-VOS and AVSBench datasets with respective text and audio references, MUTR achieves +4.2% and +8.7% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS. Code is released at https://github.com/OpenGVLab/MUTR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Refvos: A closer look at referring expressions for video object segmentation. arXiv preprint arXiv:2010.00263.
  2. End-to-end referring video object segmentation with multimodal transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4985–4995.
  3. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, 213–229. Springer.
  4. See-through-text grouping for referring image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7454–7463.
  5. Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16867–16876.
  6. Multi-Attention Network for Compressed Video Referring Object Segmentation. In Proceedings of the 30th ACM International Conference on Multimedia, 4416–4425.
  7. Vision-language transformer and query generation for referring segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 16321–16330.
  8. VLT: Vision-Language Transformer and Query Generation for Referring Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  9. Language-bridged spatial-temporal interaction for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4964–4973.
  10. Sstvos: Sparse spatiotemporal transformers for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5912–5921.
  11. InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation. arXiv preprint arXiv:2311.18835.
  12. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010.
  13. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615.
  14. OneLLM: One Framework to Align All Modalities with Language. arXiv preprint arXiv:2312.03700.
  15. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905.
  16. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  17. Vita: Video instance segmentation via object token association. arXiv preprint arXiv:2206.04403.
  18. CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), 131–135. IEEE.
  19. 1st Place Solution for YouTubeVOS Challenge 2022: Referring Video Object Segmentation. arXiv preprint arXiv:2212.14679.
  20. Bi-directional relationship inferring network for referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4424–4433.
  21. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1780–1790.
  22. Video object segmentation with language referring expressions. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part IV 14, 123–141. Springer.
  23. Segment anything. arXiv preprint arXiv:2304.02643.
  24. You only infer once: Cross-modal meta-transfer for referring video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 1297–1305.
  25. Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation. arXiv preprint arXiv:2106.01061.
  26. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, 2980–2988.
  27. SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models. arXiv preprint arXiv:2311.07575.
  28. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  29. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022.
  30. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3202–3211.
  31. Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 10034–10043.
  32. Dynamic multimodal instance segmentation guided by natural language queries. In Proceedings of the European Conference on Computer Vision (ECCV), 630–645.
  33. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), 565–571. Ieee.
  34. AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation. arXiv preprint arXiv:2305.01836.
  35. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 658–666.
  36. Urvos: Unified referring video object segmentation network with a large-scale benchmark. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, 208–223. Springer.
  37. Key-word-aware network for referring expression image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), 38–54.
  38. MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning. arXiv preprint arXiv:2310.03731.
  39. Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11686–11695.
  40. SeqFormer: Sequential Transformer for Video Instance Segmentation. arXiv preprint arXiv:2112.08275.
  41. Language as queries for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4974–4984.
  42. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327.
  43. PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation. arXiv e-prints, arXiv–2309.
  44. Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18155–18165.
  45. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10502–10511.
  46. Learning generative vision transformer with energy-based latent space for saliency prediction. Advances in Neural Information Processing Systems, 34: 15448–15463.
  47. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention. arXiv preprint arXiv:2303.16199.
  48. Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners. CVPR 2023.
  49. Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048.
  50. Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification. In ECCV 2022. Springer Nature Switzerland.
  51. Audio–Visual Segmentation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, 386–403. Springer.
  52. Not all features matter: Enhancing few-shot clip with adaptive prior refinement. ICCV 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Shilin Yan (20 papers)
  2. Renrui Zhang (100 papers)
  3. Ziyu Guo (49 papers)
  4. Wenchao Chen (17 papers)
  5. Wei Zhang (1489 papers)
  6. Hongyang Li (99 papers)
  7. Yu Qiao (563 papers)
  8. Hao Dong (175 papers)
  9. Zhongjiang He (11 papers)
  10. Peng Gao (401 papers)
Citations (19)