Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces (2312.15715v1)

Published 25 Dec 2023 in cs.CV

Abstract: The reference-based object segmentation tasks, namely referring image segmentation (RIS), few-shot image segmentation (FSS), referring video object segmentation (RVOS), and video object segmentation (VOS), aim to segment a specific object by utilizing either language or annotated masks as references. Despite significant progress in each respective field, current methods are task-specifically designed and developed in different directions, which hinders the activation of multi-task capabilities for these tasks. In this work, we end the current fragmented situation and propose UniRef++ to unify the four reference-based object segmentation tasks with a single architecture. At the heart of our approach is the proposed UniFusion module which performs multiway-fusion for handling different tasks with respect to their specified references. And a unified Transformer architecture is then adopted for achieving instance-level segmentation. With the unified designs, UniRef++ can be jointly trained on a broad range of benchmarks and can flexibly complete multiple tasks at run-time by specifying the corresponding references. We evaluate our unified models on various benchmarks. Extensive experimental results indicate that our proposed UniRef++ achieves state-of-the-art performance on RIS and RVOS, and performs competitively on FSS and VOS with a parameter-shared network. Moreover, we showcase that the proposed UniFusion module could be easily incorporated into the current advanced foundation model SAM and obtain satisfactory results with parameter-efficient finetuning. Codes and models are available at \url{https://github.com/FoundationVision/UniRef}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (133)
  1. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
  2. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  3. Tarvis: A unified approach for target-based video segmentation. arXiv preprint arXiv:2301.02657, 2023.
  4. Learning what to learn for video object segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 777–794. Springer, 2020.
  5. Mult: an end-to-end multitask learning transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12031–12041, 2022.
  6. End-to-end referring video object segmentation with multimodal transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4985–4995, 2022.
  7. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer, 2020.
  8. See-through-text grouping for referring image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7454–7463, 2019.
  9. Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021.
  10. A unified sequence interface for vision tasks. arXiv preprint arXiv:2206.07669, 2022.
  11. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  12. State-aware tracker for real-time video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9384–9393, 2020.
  13. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
  14. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021.
  15. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 640–658. Springer, 2022.
  16. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advances in Neural Information Processing Systems, 34:11781–11794, 2021.
  17. Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
  18. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  19. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  20. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  21. Mose: A new dataset for video object segmentation in complex scenes. arXiv preprint arXiv:2302.01872, 2023.
  22. Vision-language transformer and query generation for referring segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16321–16330, 2021.
  23. Vlt: Vision-language transformer and query generation for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  24. Language-bridged spatial-temporal interaction for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4964–4973, 2022.
  25. Progressive multimodal interaction network for referring video object segmentation. The 3rd Large-scale Video Object Segmentation Challenge, page 7, 2021.
  26. Self-support few-shot semantic segmentation. In European Conference on Computer Vision, pages 701–719. Springer, 2022.
  27. Encoder fusion network with co-attention embedding for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15506–15515, 2021.
  28. Ota: Optimal transport assignment for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 303–312, 2021.
  29. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
  30. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  31. Lvos: A benchmark for long-term video object segmentation. arXiv preprint arXiv:2211.10181, 2022.
  32. Cost aggregation with 4d convolutional swin transformer for few-shot segmentation. In European Conference on Computer Vision, pages 108–126. Springer, 2022.
  33. Segmentation from natural language expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 108–124. Springer, 2016.
  34. Attention-based multi-context guiding for few-shot semantic segmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8441–8448, 2019.
  35. Bi-directional relationship inferring network for referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4424–4433, 2020.
  36. Referring image segmentation via cross-modal progressive comprehension. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10488–10497, 2020.
  37. Linguistic structure guided context modeling for referring image segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 59–75. Springer, 2020.
  38. Oneformer: One transformer to rule universal image segmentation. arXiv preprint arXiv:2211.06220, 2022.
  39. Locate then segment: A strong pipeline for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9858–9867, 2021.
  40. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790, 2021.
  41. Putting the object back into video object segmentation. arXiv e-prints, pages arXiv–2310, 2023.
  42. Video object segmentation with language referring expressions. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part IV 14, pages 123–141. Springer, 2019.
  43. Restr: Convolution-free referring image segmentation using transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18145–18154, 2022.
  44. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  45. Uvim: A unified modeling approach for vision with learned guiding codes. arXiv preprint arXiv:2205.10337, 2022.
  46. You only infer once: Cross-modal meta-transfer for referring video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 1297–1305, 2022.
  47. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
  48. Recurrent dynamic embedding for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1332–1341, 2022.
  49. Referring transformer: A one-step approach to multi-task visual grounding. Advances in neural information processing systems, 34:19652–19664, 2021.
  50. Fss-1000: A 1000-class dataset for few-shot segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2869–2878, 2020.
  51. Refsam: Efficiently adapting segmenting anything model for referring video object segmentation. arXiv preprint arXiv:2307.00997, 2023.
  52. Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation. arXiv preprint arXiv:2106.01061, 2021.
  53. Video object segmentation with adaptive feature bank and uncertain-region refinement. Advances in Neural Information Processing Systems, 33:3430–3441, 2020.
  54. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  55. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  56. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  57. Polyformer: Referring image segmentation as sequential polygon generation. arXiv e-prints, pages arXiv–2302, 2023.
  58. Cross-modal progressive comprehension for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4761–4775, 2021.
  59. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  60. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  61. Part-aware prototype network for few-shot semantic segmentation, 2020.
  62. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  63. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
  64. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  65. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
  66. Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 10034–10043, 2020.
  67. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
  68. Joint inductive and transductive learning for video object segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9670–9679, 2021.
  69. Visual-textual capsule routing for text-based video segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9942–9951, 2020.
  70. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016.
  71. Hypercorrelation squeeze for few-shot segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6941–6952, 2021.
  72. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9226–9235, 2019.
  73. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  74. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  75. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  76. Occluded video instance segmentation: A benchmark. International Journal of Computer Vision, 130(8):2022–2039, 2022.
  77. Segment anything meets point tracking. arXiv preprint arXiv:2307.01197, 2023.
  78. Optimization as a model for few-shot learning. In International conference on learning representations, 2016.
  79. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
  80. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019.
  81. Learning fast and robust target models for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7406–7415, 2020.
  82. Urvos: Unified referring video object segmentation network with a large-scale benchmark. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, pages 208–223. Springer, 2020.
  83. Hierarchical memory matching network for video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12889–12898, 2021.
  84. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019.
  85. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14454–14463, 2021.
  86. Conditional convolutions for instance segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 282–298. Springer, 2020.
  87. Boxinst: High-performance instance segmentation with box annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5443–5452, 2021.
  88. Feelvos: Fast end-to-end embedding learning for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9481–9490, 2019.
  89. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6578–6588, 2020.
  90. Context modulated dynamic networks for actor and action video segmentation with language queries. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12152–12159, 2020.
  91. Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3939–3948, 2019.
  92. Few-shot semantic segmentation with democratic attention networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, pages 730–746. Springer, 2020.
  93. Panet: Few-shot image semantic segmentation with prototype alignment. In proceedings of the IEEE/CVF international conference on computer vision, pages 9197–9206, 2019.
  94. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
  95. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 1328–1338, 2019.
  96. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
  97. Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023.
  98. Seggpt: Segmenting everything in context. arXiv preprint arXiv:2304.03284, 2023.
  99. Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11686–11695, 2022.
  100. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
  101. General object foundation model for images and videos at scale. arXiv preprint arXiv:2312.09158, 2023.
  102. Language as queries for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4974–4984, 2022.
  103. Segment every reference object in spatial and temporal spaces. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2538–2550, 2023.
  104. Seqformer: a frustratingly simple model for video instance segmentation. arXiv preprint arXiv:2112.08275, 2021.
  105. In defense of online models for video instance segmentation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 588–605. Springer, 2022.
  106. Yuhang Xiao Xiao. Refersam. https://github.com/mydcxiao/ReferSAM, 2023.
  107. Efficient regional memory network for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1286–1295, 2021.
  108. Doubly deformable aggregation of covariance matrices for few-shot segmentation. In European Conference on Computer Vision, pages 133–150. Springer, 2022.
  109. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018.
  110. Reliable propagation-correction modulation for video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2946–2954, 2022.
  111. Towards grand unification of object tracking. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXI, pages 733–751. Springer, 2022.
  112. Universal instance perception as object discovery and retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15325–15336, 2023.
  113. Prototype mixture models for few-shot semantic segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 763–778. Springer, 2020.
  114. Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18155–18165, 2022.
  115. Collaborative video object segmentation by foreground-background integration. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V, pages 332–348. Springer, 2020.
  116. Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems, 34:2491–2502, 2021.
  117. Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4701–4712, 2021.
  118. Decoupling features in hierarchical propagation for video object segmentation. arXiv preprint arXiv:2210.09782, 2022.
  119. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10502–10511, 2019.
  120. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  121. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1307–1315, 2018.
  122. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016.
  123. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  124. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  125. A simple framework for open-vocabulary segmentation and detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1020–1031, 2023.
  126. Glipv2: Unifying localization and vision-language understanding. In Advances in Neural Information Processing Systems, 2022.
  127. Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048, 2023.
  128. Modeling motion with multi-modal features for text-based video segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11737–11746, 2022.
  129. Detecting twenty-thousand classes using image-level supervision. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, pages 350–368. Springer, 2022.
  130. Seqtr: A simple yet universal network for visual grounding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 598–615. Springer, 2022.
  131. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
  132. Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16804–16815, 2022.
  133. Generalized decoding for pixel, image, and language. arXiv preprint arXiv:2212.11270, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jiannan Wu (12 papers)
  2. Yi Jiang (171 papers)
  3. Bin Yan (138 papers)
  4. Huchuan Lu (199 papers)
  5. Zehuan Yuan (65 papers)
  6. Ping Luo (340 papers)
Citations (13)

Summary

UniRef++: A Unified Architecture for Referring Object Segmentation Across Images and Videos

Object segmentation is a crucial task in computer vision, enabling computers to delineate the shapes of specific objects within images or videos. Traditionally, researchers have approached the task of object segmentation by devising specialized, task-specific models, leading to fragmented development and a multiplicity of models. However, a recent work titled "UniRef++" proposes an intelligent solution: a single, unified architecture capable of mastering four different object segmentation tasks.

Unified Approach to Segmentation

At the heart of UniRef++ is the innovative UniFusion module, an engine designed to handle multiway-fusion for various segmentation tasks by fusing different types of reference information—such as language descriptions and mask annotations—into visual features. This mechanism enables a Transformer-based architecture to process these references and carry out instance-level segmentation.

One particularly exciting aspect of UniRef++ is its versatility. By specifying the appropriate reference, the same model can perform different tasks in real-time, effectively serving as a Swiss Army knife for object segmentation.

Performance Benchmarking

UniRef++'s performance doesn't just excel in theory. When tested against various benchmarks, this architecture demonstrates state-of-the-art performance in referring image segmentation (RIS) and referring video object segmentation (RVOS). Furthermore, it remains competitive in few-shot image segmentation (FSS) and video object segmentation (VOS) tasks.

Incorporating into Existing Models

Another striking feature of the UniRef++ module is its adaptability. It can be seamlessly incorporated into existing object segmentation foundation models like "SAM". Such integration allows for efficient fine-tuning and adaptation of these pre-existing models to UniRef++'s more generalized approach.

In Conclusion

UniRef++ signifies a substantial leap toward unified model architectures in the field of vision-based AI. By successfully collapsing multiple tasks into a single model, it not only fosters synergies across tasks but also substantially reduces computational costs and model complexity. This single architecture's adaptability and performance across various object segmentation tasks suggest a promising move towards more holistic, less fragmented future in AI vision capabilities.

Github Logo Streamline Icon: https://streamlinehq.com