Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

General Object Foundation Model for Images and Videos at Scale (2312.09158v1)

Published 14 Dec 2023 in cs.CV

Abstract: We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot transfer to new data and tasks. Specifically, we employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks, GLEE exhibits remarkable versatility and improved generalization performance, efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data, we further enhance its zero-shot generalization capabilities. Additionally, GLEE is capable of being integrated into LLMs, serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The model and code will be released at https://glee-vision.github.io .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (135)
  1. Efficient interactive annotation of segmentation datasets with polygon-rnn++. 2018.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  3. Burst: A benchmark for unifying object recognition, segmentation and tracking in video. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1674–1683, 2023.
  4. Tracking without bells and whistles. In ICCV, 2019a.
  5. Tracking without bells and whistles. In ICCV, 2019b.
  6. Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pages 3464–3468. IEEE, 2016.
  7. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  8. Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In Proceedings eighth IEEE international conference on computer vision. ICCV 2001, pages 105–112. IEEE, 2001.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  10. One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 221–230, 2017.
  11. End-to-end object detection with transformers. In ECCV, 2020.
  12. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  13. Annotating object instances with a polygon-rnn. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5230–5238, 2017.
  14. Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020.
  15. A unified sequence interface for vision tasks. Advances in Neural Information Processing Systems, 35:31333–31346, 2022.
  16. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
  17. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In European Conference on Computer Vision, pages 640–658. Springer, 2022.
  18. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advances in Neural Information Processing Systems, 34:11781–11794, 2021.
  19. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  20. Tao: A large-scale benchmark for tracking any object. In ECCV, 2020.
  21. Motchallenge: A benchmark for single-camera multiple target tracking. International Journal of Computer Vision, 129(4):845–881, 2021.
  22. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  23. Mose: A new dataset for video object segmentation in complex scenes. arXiv preprint arXiv:2302.01872, 2023.
  24. Progressive multimodal interaction network for referring video object segmentation. The 3rd Large-scale Video Object Segmentation Challenge, 2021.
  25. Coarse-to-fine vision-language pre-training with fusion in the backbone. NeurIPS, 35:32942–32956, 2022.
  26. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023a.
  27. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023b.
  28. Promptdet: Towards open-vocabulary detection using uncurated images. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, page 701–717, 2022.
  29. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2918–2928, 2021a.
  30. Multi-task self-training for learning general representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8856–8865, 2021b.
  31. Open-vocabulary object detection via vision and language knowledge distillation. In International Conference on Learning Representations, 2022.
  32. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019.
  33. Deep residual learning for image recognition. In CVPR, 2016.
  34. Mask R-CNN. In ICCV, 2017.
  35. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  36. Vita: Video instance segmentation via object token association. In Advances in Neural Information Processing Systems, 2022.
  37. A generalized framework for video instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14623–14632, 2023.
  38. Segmentation from natural language expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 108–124. Springer, 2016.
  39. Minvis: A minimal video instance segmentation framework without video-based training. In NeurIPS, 2022.
  40. Video instance segmentation using inter-frame communication transformers. In NeurIPS, 2021.
  41. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  42. Mdetr-modulated detection for end-to-end multi-modal understanding. In ICCV, pages 1780–1790, 2021.
  43. Segment anything. In ICCV, 2023.
  44. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  45. Open-vocabulary object detection upon frozen vision and language models. In The Eleventh International Conference on Learning Representations, 2023.
  46. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
  47. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  48. Elevater: A benchmark and toolkit for evaluating language-augmented visual models. In NeurIPS, 2022a.
  49. You only infer once: Cross-modal meta-transfer for referring video object segmentation. In AAAI, 2022b.
  50. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In CVPR, 2023a.
  51. Uni-perceiver v2: A generalist model for large-scale vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2691–2700, 2023b.
  52. Grounded language-image pre-training. In CVPR, pages 10965–10975, 2022c.
  53. Tracking every thing in the wild. In ECCV, 2022d.
  54. Ovtrack: Open-vocabulary multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5567–5577, 2023c.
  55. Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pages 280–296. Springer, 2022e.
  56. Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation. arXiv preprint arXiv:2106.01061, 2021.
  57. Learning object-language alignments for open-vocabulary object detection. In The Eleventh International Conference on Learning Representations, 2023.
  58. Microsoft coco: Common objects in context. In ECCV, 2014.
  59. Focal loss for dense object detection. In ICCV, 2017.
  60. Swem: Towards real-time video object segmentation with sequential weighted expectation-maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1362–1372, 2022.
  61. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  62. Polyformer: Referring image segmentation as sequential polygon generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18653–18663, 2023b.
  63. Simpleclick: Interactive image segmentation with simple vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22290–22300, 2023c.
  64. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021.
  65. Decoupled weight decay regularization. In ICLR, 2019.
  66. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
  67. Soc: Semantic-assisted object cluster for referring video object segmentation. In NeurIPS, 2023.
  68. TrackFormer: Multi-object tracking with transformers. arXiv preprint arXiv:2101.02702, 2021.
  69. Novis: A case for end-to-end near-online video instance segmentation. arXiv preprint arXiv:2308.15266, 2023.
  70. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), 2016.
  71. Simple open-vocabulary object detection. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part X, page 728–755, 2022.
  72. Modeling context between objects for referring expression understanding. In ECCV, 2016.
  73. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9226–9235, 2019.
  74. Quasi-dense similarity learning for multiple object tracking. In CVPR, 2021.
  75. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  76. Occluded video instance segmentation: A benchmark. International Journal of Computer Vision, pages 1–18, 2022.
  77. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  78. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  79. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  80. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  81. Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, 2019.
  82. Learning fast and robust target models for video object segmentation. In CVPR, 2020.
  83. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  84. " grabcut" interactive foreground extraction using iterated graph cuts. ACM transactions on graphics (TOG), 23(3):309–314, 2004.
  85. Omnilabel: A challenging benchmark for language-based object detection. In ICCV, 2023.
  86. Urvos: Unified referring video object segmentation network with a large-scale benchmark. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, pages 208–223. Springer, 2020.
  87. Intern: A new learning paradigm towards general vision. arXiv preprint arXiv:2111.08687, 2021.
  88. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019.
  89. Reviving iterative training with mask guidance for interactive segmentation. In 2022 IEEE International Conference on Image Processing (ICIP), pages 3141–3145. IEEE, 2022.
  90. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14454–14463, 2021.
  91. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  92. Siam R-CNN: Visual tracking by re-detection. In CVPR, 2020.
  93. Towards open-vocabulary video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4057–4066, 2023.
  94. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022a.
  95. Fast online object tracking and segmentation: A unifying approach. In CVPR, 2019.
  96. Unidentified video objects: A benchmark for dense, open-world segmentation. In ICCV, 2021a.
  97. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022b.
  98. End-to-end video instance segmentation with transformers. In CVPR, 2021b.
  99. Simple online and realtime tracking with a deep association metric. In ICIP, 2017.
  100. Milcut: A sweeping line multiple instance learning paradigm for interactive image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 256–263, 2014.
  101. Seqformer: Sequential transformer for video instance segmentation. In ECCV, 2022a.
  102. Language as queries for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4974–4984, 2022b.
  103. In defense of online models for video instance segmentation. In ECCV, pages 588–605. Springer, 2022c.
  104. Segment every reference object in spatial and temporal spaces. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2538–2550, 2023.
  105. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
  106. Florence-2: Advancing a unified representation for a variety of vision tasks. arXiv preprint arXiv:2311.06242, 2023.
  107. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2966, 2023.
  108. Youtubevis dataset 2021 version. https://youtube-vos.org/dataset/vis/.
  109. Deep interactive object selection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 373–381, 2016.
  110. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018.
  111. Towards grand unification of object tracking. In ECCV, 2022.
  112. Universal instance perception as object discovery and retrieval. In CVPR, 2023.
  113. Video instance segmentation. In ICCV, 2019.
  114. Unitab: Unifying text and box outputs for grounded vision-language modeling. In European Conference on Computer Vision, pages 521–539. Springer, 2022.
  115. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. In NeurIPS, 2022.
  116. Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23497–23506, 2023.
  117. Cross-modal self-attention network for referring image segmentation. In CVPR, 2019.
  118. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020.
  119. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022.
  120. Modeling context in referring expressions. In ECCV, 2016.
  121. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  122. Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14393–14402, 2021.
  123. Glipv2: Unifying localization and vision-language understanding. In Advances in Neural Information Processing Systems, pages 36067–36080, 2022.
  124. Dvis: Decoupled video instance segmentation framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1282–1291, 2023.
  125. A transductive approach for video object segmentation. In CVPR, 2020.
  126. Bytetrack: Multi-object tracking by associating every detection box. arXiv preprint arXiv:2110.06864, 2021.
  127. Regionclip: Region-based language-image pretraining. In CVPR, pages 16793–16803, 2022a.
  128. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803, 2022b.
  129. Tracking objects as points. In ECCV, 2020.
  130. Detecting twenty-thousand classes using image-level supervision. In ECCV, pages 350–368. Springer, 2022.
  131. Seqtr: A simple yet universal network for visual grounding. In European Conference on Computer Vision, pages 598–615. Springer, 2022a.
  132. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2021.
  133. Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16804–16815, 2022b.
  134. Generalized decoding for pixel, image, and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15116–15127, 2023a.
  135. Segment everything everywhere all at once. In NeurIPS, 2023b.
Citations (25)

Summary

  • The paper introduces GLEE, a unified foundation model that handles object detection, segmentation, and tracking without task-specific adaptations.
  • It employs an integrated architecture combining image and text encoders with visual prompting to address diverse object-centric tasks uniformly.
  • Training on over five million images with auto-labeled data, GLEE demonstrates impressive zero-shot generalization, especially in open-world scenarios.

Introduction to General Object Foundation Models

In recent years, the artificial intelligence field has witnessed a surge in the development of foundation models—models that can be used for a broad spectrum of tasks. While such models have revolutionized NLP, the visual domain presents distinct challenges due to the variety of task types and a lack of unified form. Current visual foundation models are somewhat fragmented and tend to specialize in subdomains like multimodal interaction or image-style representations. Aiming to bridge this gap, a new paradigm has emerged in the form of a general object foundation model dubbed GLEE, which stands for General Language-Enabled Encoder.

Unified Approach for Object Perception

GLEE's architecture unifies multi-task learning with the integration of an image encoder, text encoder, and visual prompter. It utilizes the transformers' power to extract objects from images according to textual and visual input comprehensively. The model's framework treats several object-centric tasks, including object detection, instance segmentation, and object tracking, as variants of the same underlying problem. This strategy allows GLEE to avoid task-specific designs or adaptation, maximizing efficiency.

Learning Strategy and Large-Scale Training

GLEE benefits from a cohesive learning strategy, acquiring knowledge from vast and varied data sources that range from large detection datasets to detailed visual genome benchmarks. What sets GLEE apart is the ability to scale up training data cheaply by employing automatically labeled data. These automatically labeled datasets massively enhance the model's zero-shot capabilities, enabling transfer to new data and tasks without any prior fine-tuning. The extensive training regimen included over five million images, empowering the model with an impressive ability to generalize and perform robustly across different benchmarks.

Versatile Performance Across Diverse Tasks

In testing, GLEE has demonstrated superior performance, often outshining specialized models in detection and segmentation tasks. Its robustness is especially evident in the open-world detection scenario, where it could identify objects in classes unseen during training. Moreover, GLEE's ability to generalize extends to video tasks, where it performed exceptionally well without the need for task-specific video training. As an added value, it integrates seamlessly into larger LLMs, contributing valuable visual object-level information to bolster the performance of multi-modal tasks.

In sum, GLEE represents a significant stride toward the development of versatile foundation models for visual perception, establishing a robust framework for future AI systems that require comprehensive understanding across modalities and tasks.

Github Logo Streamline Icon: https://streamlinehq.com