Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (2309.02591v1)

Published 5 Sep 2023 in cs.LG, cs.CL, and cs.CV

Abstract: We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented, token-based, decoder-only multi-modal LLM capable of generating and infilling both text and images. CM3Leon uses the CM3 multi-modal architecture but additionally shows the extreme benefits of scaling up and tuning on more diverse instruction-style data. It is the first multi-modal model trained with a recipe adapted from text-only LLMs, including a large-scale retrieval-augmented pre-training stage and a second multi-task supervised fine-tuning (SFT) stage. It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs. Extensive experiments demonstrate that this recipe is highly effective for multi-modal models. CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods (zero-shot MS-COCO FID of 4.88). After SFT, CM3Leon can also demonstrate unprecedented levels of controllability in tasks ranging from language-guided image editing to image-controlled generation and segmentation.

PDF Abstract

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

This paper presents CM3Leon, a retrieval-augmented, token-based, decoder-only multi-modal LLM capable of generating and infilling both text and images. CM3Leon builds upon and expands the autoregressive multi-modal CM3 architecture, providing substantial improvements in efficiency and performance by leveraging scaling and instruction-style data tuning. Key innovations include a dual-stage training regime comprising retrieval-augmented pretraining and multi-task supervised fine-tuning (SFT).

CM3Leon's pretraining strategy involves a retrieval-augmented approach using a substantial Shutterstock dataset composed exclusively of licensed image and text data, thereby addressing potential ethical issues in data sourcing commonly associated with image ownership and attribution. The fine-tuning stage implements multi-task instruction tuning, allowing for arbitrary mixtures of image and text tokens in both models' inputs and outputs. This design supports a broad range of task generalization, encompassing tasks like language-guided image editing and image-controlled generation.

In evaluating CM3Leon, the authors demonstrate significant state-of-the-art performance in the text-to-image generation tasks. The model achieves a zero-shot MS-COCO FID of 4.88 while utilizing 5x less training compute compared to equivalent models. This accomplishment underscores the efficacy of CM3Leon’s architecture in achieving high-quality image generation with reduced computational cost. CM3Leon's superiority in both text-to-image and image-to-text generation also highlights its versatility and adaptability across diverse tasks.

The paper reports that CM3Leon’s innovative contrastive decoding method further enhances the quality of generated outputs by incorporating self-guidance. This novel approach effectively complements Classifier Free Guidance (CFG), thereby expanding the model's decoding strategies and enhancing its transfer learning capabilities.

Theoretical implications indicate that retrieval-augmented autoregressive models, such as CM3Leon, hold promise for extending capabilities in multi-modal generative tasks beyond conventional diffusion models, which have dominated the recent landscape. Practically, the methodologies adopted for CM3Leon suggest pathways for scaling similar architectures in various multi-modal AI applications, enabling advancements in fields that require efficient and scalable generation of complex image-text pairings.

Overall, CM3Leon exemplifies advancements in multimodal LLMs, leveraging pretraining and fine-tuning techniques to establish a new paradigm in balancing efficiency with performance. Future research directions could explore combinations of retrieval-augmented autoregressive models with other forms of conditional guidance and their implications across expanding domains in artificial intelligence. The potential integration of CM3Leon-like frameworks in real-world applications could yield innovative solutions in content creation, enhanced human-computer interaction, and adaptive learning environments.