OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models (2212.04408v1)

Published 8 Dec 2022 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys

PDF Abstract

Insights into a Multi-Modal Multi-Task Learning System for Building Generalist Models

The paper presents a well-defined system, denoted as OFASys, catering to the current needs in the multi-modal multi-task learning domain. It systematically approaches the challenges faced in scaling across diverse modalities and tasks which are critical to advancing generalist models—a step towards achieving artificial general intelligence (AGI).

Core Contributions

OFASys is prominently designed to facilitate rapid experimental setups, enabling scalability across multiple modalities such as text, image, video, audio, and motion, among others. Its architectural novelty lies in the separation of task representation from model implementation. This decoupling is achieved through a declarative task interface expressed in natural language, which defines a task using slots that map data of various modalities into representations. Such a design choice empowers researchers to rapidly compose and redefine tasks without changing the underlying model structures, providing a high degree of flexibility and adaptability.

System Design

The paper’s central focus is on the modularity of OFASys that allows for composing tasks using existing or custom data processing pipelines. It achieves this by leveraging a hierarchy of components: modality-specific preprocessors and adapters transform the raw data into a unified format that is compatible with the universal model.

Inclusion of a universal computing engine marks a critical element in the system's design. This engine can be configured to be modality-agnostic, capable of handling various training objectives and different modes of learning, including sequence-to-sequence and diffusion models. Moreover, OFASys supports both supervised and multi-modal training, enhancing its application scope.

Empirical Validation

To validate the effectiveness of OFASys, the authors train a series of models collectively termed OFA+. Notably, the generalist model, OFA+ (Generalist MoE), which incorporates a sparsely activated MoE design within the universal model, showcases promising results across an extensive set of tasks spanning seven modalities. Noteworthy is its performance relative to specialized models: achieving 95% of their performance while using only 16% of the parameters, suggesting efficient parameter utilization and task-independent learning capabilities. The empirical results, as detailed in the paper, highlight the adaptability and scalability of OFASys, making it an inspiring foundation for future generalist AI systems.

Implications and Future Directions

Practically, OFASys could significantly lower the barriers in multi-task learning research, promoting more agile and comprehensive exploration into generalist models. Theoretically, the separation between task formulation and computation holds significant promise for enhancing the versatility of large-scale AI models.

The authors' intent to contribute to the open-source ecosystem is commendable, and they effectively underscore the versatility of OFASys by providing presets across modalities and tasks. This strategy paves the way for rapid prototyping and reduces entry barriers for new research in multi-modal domains. Future avenues could explore integrating emerging architectures like Transformer-based diffusion models and expanding the system's coverage to more complex multi-step reasoning tasks.

In conclusion, the paper successfully delineates a comprehensive framework with OFASys, contributing to critical advancements in multi-modal multi-task learning systems. It underscores the potential for building more inclusive and generalist AI models, pushing the boundaries of what current AI systems can achieve.

PDF Markdown Bookmark Chat (Pro)

Authors (18)

Jinze Bai (10 papers)
Rui Men (21 papers)
Hao Yang (328 papers)
Xuancheng Ren (59 papers)
Kai Dang (13 papers)
Yichang Zhang (24 papers)
Xiaohuan Zhou (13 papers)
Peng Wang (831 papers)
Sinan Tan (12 papers)
An Yang (32 papers)
Zeyu Cui (29 papers)
Yu Han (96 papers)
Shuai Bai (22 papers)
Wenbin Ge (5 papers)
Jianxin Ma (20 papers)
Junyang Lin (99 papers)
Jingren Zhou (198 papers)
Chang Zhou (105 papers)

Citations (15)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - OFA-Sys/OFASys: OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models (150 stars)