UnIVAL: Unified Model for Image, Video, Audio and Language Tasks (2307.16184v2)

Published 30 Jul 2023 in cs.CV, cs.LG, cs.MM, cs.SD, and eess.AS

Abstract: LLMs have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising solution is unification, allowing the support of a myriad of tasks and modalities within one unified framework. While few large models (e.g., Flamingo (Alayrac et al., 2022), trained on massive datasets, can support more than two modalities, current small to mid-scale unified models are still limited to 2 modalities, usually image-text or video-text. The question that we ask is: is it possible to build efficiently a unified model that can support all modalities? To answer this, we propose UnIVAL, a step further towards this ambitious goal. Without relying on fancy datasets sizes or models with billions of parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model. Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning. UnIVAL shows competitive performance to existing state-of-the-art approaches, across image and video-text tasks. The feature representations learned from image and video-text modalities, allows the model to achieve competitive performance when finetuned on audio-text tasks, despite not being pretrained on audio. Thanks to the unified model, we propose a novel study on multimodal model merging via weight interpolation of models trained on different multimodal tasks, showing their benefits in particular for out-of-distribution generalization. Finally, we motivate unification by showing the synergy between tasks. The model weights and code are released here: https://github.com/mshukor/UnIVAL.

PDF HTML Abstract

A Unified Model for Diverse Multimodal Tasks

This paper proposes a unified model designed to simultaneously address tasks across multiple modalities, including image, video, audio, and text. This ambitious goal seeks to streamline the extensive diversity and heterogeneity inherent in multimodal tasks, typically supported by distinct models. This work contributes to the burgeoning field of unified models, particularly focusing on scalability and efficiency.

Model Architecture and Methodology

The model employs a sequence-to-sequence neural architecture specifically designed to handle the representation and transformation of different modalities into a unified token-based input format. Notably, it leverages a relatively moderate model size of approximately 0.25 billion parameters, significantly smaller than many existing multi-billion parameter models in the field. This reduction in size is achieved without sacrificing the model’s ability to handle multiple modalities, an important consideration for resource-constrained environments.

Key to the architecture is the use of a linear connection layer for tokenizing non-textual data, such as images and audio. This technique engenders efficient mapping into the shared input space of a pretrained LLM, which forms the core of the system’s processing capability. The model is trained using a next-token prediction objective, thereby allowing it to both understand and generate coherent language-based outputs across diverse tasks.

Performance Evaluation

The paper presents the competitive performance of the model across a suite of standard benchmarks. In tasks such as Visual Grounding, the model achieves state-of-the-art results, notably in the RefCOCO, RefCOCO+, and RefCOCOg datasets. For more traditional language-related evaluations like VQAv2 (Visual Question Answering) and Image Captioning on the MSCOCO dataset, the model exhibits commendable efficacy, rivaling or often outperforming existing approaches that require larger training datasets.

Innovative Contributions

Among the novel contributions of this paper is the demonstration of multimodal curriculum learning, which provides a systematic pathway for efficiently incorporating multiple modalities into the training regimen. This strategy involves incrementally introducing additional modalities to the training process so that the model does not burden computational resources with the requirement to handle all data at once. This approach not only curtails computational costs but also, as demonstrated in the results, can enhance model generalization to new or less familiar modalities.

Moreover, this paper addresses the potential of downstream task adaptation via a weight interpolation strategy, showcasing the ability to merge expertise from models fine-tuned on different tasks. This feature highlights the model’s versatility and capability for seamless task transfer, a critical aspect for developing adaptable AI systems capable of real-time learning and adaptation.

Implications and Future Directions

The implications of this research extend both practically and theoretically. Practically, it informs the development of generalist agents capable of performing diverse multimodal tasks—an encouraging step towards more sophisticated and versatile AI applications. Theoretically, it sparks discourse on the trade-offs between model size, efficiency, and performance, emphasizing the viability of streamlined models for robust task execution.

Future research directions suggested by the paper include scaling the model size while advancing the unification strategy to accommodate more complex data interactions and tasks. Additionally, reducing hallucinations and improving handling of complex instructions remain important challenges. Exploring more training techniques and curriculum strategies may further bolster generalization capabilities, especially in anticipated scenarios involving new or unobserved modalities.

In sum, this paper significantly advances the quest for a unified multimodal model, offering substantial insights into efficient training, effective integration of diverse data types, and dynamic task adaptability. It serves as a competent model for both academic inquiry and real-world application development in the domain of AI systems.