Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
This paper presents CM3Leon, a retrieval-augmented, token-based, decoder-only multi-modal LLM capable of generating and infilling both text and images. CM3Leon builds upon and expands the autoregressive multi-modal CM3 architecture, providing substantial improvements in efficiency and performance by leveraging scaling and instruction-style data tuning. Key innovations include a dual-stage training regime comprising retrieval-augmented pretraining and multi-task supervised fine-tuning (SFT).
CM3Leon's pretraining strategy involves a retrieval-augmented approach using a substantial Shutterstock dataset composed exclusively of licensed image and text data, thereby addressing potential ethical issues in data sourcing commonly associated with image ownership and attribution. The fine-tuning stage implements multi-task instruction tuning, allowing for arbitrary mixtures of image and text tokens in both models' inputs and outputs. This design supports a broad range of task generalization, encompassing tasks like language-guided image editing and image-controlled generation.
In evaluating CM3Leon, the authors demonstrate significant state-of-the-art performance in the text-to-image generation tasks. The model achieves a zero-shot MS-COCO FID of 4.88 while utilizing 5x less training compute compared to equivalent models. This accomplishment underscores the efficacy of CM3Leon’s architecture in achieving high-quality image generation with reduced computational cost. CM3Leon's superiority in both text-to-image and image-to-text generation also highlights its versatility and adaptability across diverse tasks.
The paper reports that CM3Leon’s innovative contrastive decoding method further enhances the quality of generated outputs by incorporating self-guidance. This novel approach effectively complements Classifier Free Guidance (CFG), thereby expanding the model's decoding strategies and enhancing its transfer learning capabilities.
Theoretical implications indicate that retrieval-augmented autoregressive models, such as CM3Leon, hold promise for extending capabilities in multi-modal generative tasks beyond conventional diffusion models, which have dominated the recent landscape. Practically, the methodologies adopted for CM3Leon suggest pathways for scaling similar architectures in various multi-modal AI applications, enabling advancements in fields that require efficient and scalable generation of complex image-text pairings.
Overall, CM3Leon exemplifies advancements in multimodal LLMs, leveraging pretraining and fine-tuning techniques to establish a new paradigm in balancing efficiency with performance. Future research directions could explore combinations of retrieval-augmented autoregressive models with other forms of conditional guidance and their implications across expanding domains in artificial intelligence. The potential integration of CM3Leon-like frameworks in real-world applications could yield innovative solutions in content creation, enhanced human-computer interaction, and adaptive learning environments.