General Object Foundation Model for Images and Videos at Scale (2312.09158v1)

Published 14 Dec 2023 in cs.CV

Abstract: We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot transfer to new data and tasks. Specifically, we employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks, GLEE exhibits remarkable versatility and improved generalization performance, efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data, we further enhance its zero-shot generalization capabilities. Additionally, GLEE is capable of being integrated into LLMs, serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The model and code will be released at https://glee-vision.github.io .

References (135)

Citations (25)

View on Semantic Scholar

Summary

The paper introduces GLEE, a unified foundation model that handles object detection, segmentation, and tracking without task-specific adaptations.
It employs an integrated architecture combining image and text encoders with visual prompting to address diverse object-centric tasks uniformly.
Training on over five million images with auto-labeled data, GLEE demonstrates impressive zero-shot generalization, especially in open-world scenarios.

Introduction to General Object Foundation Models

In recent years, the artificial intelligence field has witnessed a surge in the development of foundation models—models that can be used for a broad spectrum of tasks. While such models have revolutionized NLP, the visual domain presents distinct challenges due to the variety of task types and a lack of unified form. Current visual foundation models are somewhat fragmented and tend to specialize in subdomains like multimodal interaction or image-style representations. Aiming to bridge this gap, a new paradigm has emerged in the form of a general object foundation model dubbed GLEE, which stands for General Language-Enabled Encoder.

Unified Approach for Object Perception

GLEE's architecture unifies multi-task learning with the integration of an image encoder, text encoder, and visual prompter. It utilizes the transformers' power to extract objects from images according to textual and visual input comprehensively. The model's framework treats several object-centric tasks, including object detection, instance segmentation, and object tracking, as variants of the same underlying problem. This strategy allows GLEE to avoid task-specific designs or adaptation, maximizing efficiency.

Learning Strategy and Large-Scale Training

GLEE benefits from a cohesive learning strategy, acquiring knowledge from vast and varied data sources that range from large detection datasets to detailed visual genome benchmarks. What sets GLEE apart is the ability to scale up training data cheaply by employing automatically labeled data. These automatically labeled datasets massively enhance the model's zero-shot capabilities, enabling transfer to new data and tasks without any prior fine-tuning. The extensive training regimen included over five million images, empowering the model with an impressive ability to generalize and perform robustly across different benchmarks.

Versatile Performance Across Diverse Tasks

In testing, GLEE has demonstrated superior performance, often outshining specialized models in detection and segmentation tasks. Its robustness is especially evident in the open-world detection scenario, where it could identify objects in classes unseen during training. Moreover, GLEE's ability to generalize extends to video tasks, where it performed exceptionally well without the need for task-specific video training. As an added value, it integrates seamlessly into larger LLMs, contributing valuable visual object-level information to bolster the performance of multi-modal tasks.

In sum, GLEE represents a significant stride toward the development of versatile foundation models for visual perception, establishing a robust framework for future AI systems that require comprehensive understanding across modalities and tasks.

PDF Markdown

GitHub

GLEE:General Object Foundation Model for Images and Videos at Scale

Tweets

https://twitter.com/794433401591693312/status/1735492439220035669

https://twitter.com/1637708085958696961/status/1736377397472051330