COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training (2401.00849v1)

Published 1 Jan 2024 in cs.CV

Abstract: In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. Recent autoregressive vision-LLMs like \cite{flamingo, palme}, leveraging the long-context capability of LLMs, have excelled in few-shot text generation tasks but face challenges in alignment tasks. Addressing this gap, we introduce the contrastive loss into text generation models, presenting the COntrastive-Streamlined MultimOdal framework (\ModelName), strategically partitioning the LLM into dedicated unimodal text processing and adept multimodal data handling components. \ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters. However, these models demand extensive long-text datasets, yet the availability of high-quality long-text video datasets remains limited. To bridge this gap, this work introduces \VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions, marking a significant step forward. Demonstrating its impact, we illustrate how \VideoDatasetName{} enhances model performance in image-text tasks. With 34% learnable parameters and utilizing 72\% of the available data, our model demonstrates significant superiority over OpenFlamingo~\cite{openflamingo}. For instance, in the 4-shot flickr captioning task, performance notably improves from 57.2% to 65.\%. The contributions of \ModelName{} and \VideoDatasetName{} are underscored by notable performance gains across 14 diverse downstream datasets encompassing both image-text and video-text tasks.

PDF HTML Abstract

Introduction

AI research has been progressively advancing towards models that can understand and process not only text but also multimodal information—data that combines text with other forms such as images or videos. This development has come with its own set of challenges, especially in aligning and processing such multimodal data effectively.

COSMO Framework

Addressing these challenges, the paper introduces COSMO, a COntrastive-Streamlined MultimOdal Model, that integrates contrastive loss with text generation models. COSMO stands out by dividing the LLM into two segments—one focusing on processing text and the other on fusing multimodal information. This distinction enables COSMO to efficiently manage both unimodal and multimodal tasks, showing an impressive reduction in learnable parameters and performance improvements in 14 different downstream tasks, including those involving images, texts, and video data.

Howto-Interlink7M Dataset

A key hurdle for training such models is the lack of quality long-text multimodal datasets. The paper addresses this by introducing a novel video-text data set called Howto-Interlink7M. Derived by annotating segments of instructional videos, this dataset stands out for its high-quality, detailed captions that preserve narrative coherence across video clips. The dataset has a substantial impact on the COSMO model's performance, improving it even further in various image-text and video-text tasks.

Performance and Evaluation

When compared to OpenFlamingo, a similar autoregressive vision-LLM, COSMO shows a pronounced improvement in model performance even though it employs fewer learnable parameters and a smaller dataset sample size. This advantage is particularly noticeable in challenging tasks such as the Flickr captioning task, where COSMO outstrips OpenFlamingo's performance by a significant margin.

Conclusion

The integration of contrastive loss into multimodal learning frameworks, along with the development of high-quality, long-text datasets, represents a promising direction for AI research. The advancements made by COSMO and the Howto-Interlink7M dataset not only set new standards for multimodal tasks but also open up extensive opportunities for future research, especially in the field of long-text data applications. The release of the trained models and datasets is eagerly anticipated, with the potential to catalyze further research in the field.

PDF Markdown Bookmark Chat (Pro)

References (63)

Authors (8)

Alex Jinpeng Wang (20 papers)
Linjie Li (89 papers)
Kevin Qinghong Lin (28 papers)
Jianfeng Wang (149 papers)
Kevin Lin (98 papers)
Zhengyuan Yang (86 papers)
Lijuan Wang (133 papers)
Mike Zheng Shou (165 papers)

Citations (8)

View on Semantic Scholar

Tweets

https://twitter.com/22146921/status/1742300606298271861

https://twitter.com/123543935/status/1742050522104742318

https://twitter.com/267523967/status/1742524923736981770