GiT: Towards Generalist Vision Transformer through Universal Language Interface (2403.09394v1)

Published 14 Mar 2024 in cs.CV

Abstract: This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g, GPT) widely used in LLMs, we seek to broaden its scope to serve as a powerful vision foundation model (VFM). However, unlike LLMing, visual tasks typically require specific modules, such as bounding box heads for detection and pixel decoders for segmentation, greatly hindering the application of powerful multi-layer transformers in the vision domain. To solve this, we design a universal language interface that empowers the successful auto-regressive decoding to adeptly unify various visual tasks, from image-level understanding (e.g., captioning), over sparse perception (e.g., detection), to dense prediction (e.g., segmentation). Based on the above designs, the entire model is composed solely of a ViT, without any specific additions, offering a remarkable architectural simplification. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning. Interestingly, our GiT builds a new benchmark in generalist performance, and fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. This reflects a similar impact observed in LLMs. Further enriching training with 27 datasets, GiT achieves strong zero-shot results over various tasks. Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language. Code and models will be available at \url{https://github.com/Haiyang-W/GiT}.

PDF HTML Abstract

GiT: Towards Generalist Vision Transformer through Universal Language Interface

The paper "GiT: Towards Generalist Vision Transformer through Universal Language Interface" introduces a novel framework, referred to as GiT. The proposed framework leverages a vision transformer (ViT) structure enhanced by a universal language interface to seamlessly perform an ensemble of visual perception tasks within a unified architecture. This approach aims to bridge the architectural divide between vision and LLMs, emulating the successful multi-task capabilities demonstrated by LLMs such as GPT and BERT.

Framework and Novel Contributions

The core concept within GiT is its utilization of a plain ViT, devoid of task-specific modules such as bounding box heads or pixel decoders which are typically required for image detection and segmentation tasks. Instead, GiT introduces a universal language interface that translates various visual tasks into an auto-regressive decoding framework, a strategy widely successful in LLMs.

Key Design Elements:

Unified Input and Output Representation: GiT standardizes the representation of diverse input modalities (e.g., images, text, coordinates) into token sequences. This seamless integration utilizes a standard vocabulary set for the universal language interface, effectively compressing complex multi-piece concepts into single tokens.
Parallel Decoding Multi-Task Template: The framework deploys a grid sampling approach to partition visual tasks into several local subproblems, each processed with local image tokens and task-specific identifiers. This parallel decoding enhances efficiency and extends the versatility of the model to handle dense predictive tasks such as object detection and semantic segmentation.
Multi-Layer Transformer Model: GiT employs a multi-layer transformer architecture (akin to LLMs), allowing shared computational resources to be focused on task-agnostic processing while a small fraction is dedicated to handling diverse input modalities efficiently.

Training and Evaluation

Multi-task Learning:

GiT is trained across five representative benchmarks covering image captioning, detection, segmentation, and visual grounding. The framework simplifies the training pipeline by consolidating diverse tasks into a unified model without task-specific fine-tuning. The results suggest that GiT achieves significant performance enhancements by leveraging shared parameters and representations across different tasks.

Scaling and Universality:

To optimize generalization, GiT was trained using 27 datasets from varied sources, demonstrating strong zero-shot and few-shot performance across multiple tasks and domains. Evaluations against established single-task models and other generalist models indicated GiT's competitive or superior performance, highlighting its broad applicability and the effectiveness of its simplified multi-layer transformer design.

Implications and Future Directions

Practical Implications:

GiT's architecture reduces the complexity involved in deploying vision models by removing the necessity of maintaining task-specific modules and training pipelines. This facilitates easier model scaling and adaptation to new tasks with minimal architectural adjustments. The approach also underscores the potential of using universal sequence models to tackle a wide array of vision tasks, thus simplifying model deployment in practical AI applications.

Theoretical Implications:

The successful application of a universal language interface and multi-layer transformer architecture to vision tasks advances the understanding of architectural unification in AI models. It demonstrates that the architectural principles underlying successful LLMs can be effectively transferred to visual models, aiding in the development of more versatile and flexible AI systems.

Future Research:

Future developments may explore further integration of GiT with even more comprehensive datasets, potentially across both vision and language domains, to enhance task-level zero-shot capabilities. Additionally, investigating optimized inference techniques for auto-regressive models could mitigate latency issues, making them more efficient for real-time applications.

In sum, GiT sets a precedent for versatile visual modeling through a simplified yet powerful architecture, reinforcing the growing convergence of methodologies in the vision and language fields. This research opens avenues for continued innovation in creating generalized AI models capable of mastering diverse tasks with streamlined architectures.