Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (2309.02591v1)

Published 5 Sep 2023 in cs.LG, cs.CL, and cs.CV

Abstract: We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented, token-based, decoder-only multi-modal LLM capable of generating and infilling both text and images. CM3Leon uses the CM3 multi-modal architecture but additionally shows the extreme benefits of scaling up and tuning on more diverse instruction-style data. It is the first multi-modal model trained with a recipe adapted from text-only LLMs, including a large-scale retrieval-augmented pre-training stage and a second multi-task supervised fine-tuning (SFT) stage. It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs. Extensive experiments demonstrate that this recipe is highly effective for multi-modal models. CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods (zero-shot MS-COCO FID of 4.88). After SFT, CM3Leon can also demonstrate unprecedented levels of controllability in tasks ranging from language-guided image editing to image-controlled generation and segmentation.

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

This paper presents CM3Leon, a retrieval-augmented, token-based, decoder-only multi-modal LLM capable of generating and infilling both text and images. CM3Leon builds upon and expands the autoregressive multi-modal CM3 architecture, providing substantial improvements in efficiency and performance by leveraging scaling and instruction-style data tuning. Key innovations include a dual-stage training regime comprising retrieval-augmented pretraining and multi-task supervised fine-tuning (SFT).

CM3Leon's pretraining strategy involves a retrieval-augmented approach using a substantial Shutterstock dataset composed exclusively of licensed image and text data, thereby addressing potential ethical issues in data sourcing commonly associated with image ownership and attribution. The fine-tuning stage implements multi-task instruction tuning, allowing for arbitrary mixtures of image and text tokens in both models' inputs and outputs. This design supports a broad range of task generalization, encompassing tasks like language-guided image editing and image-controlled generation.

In evaluating CM3Leon, the authors demonstrate significant state-of-the-art performance in the text-to-image generation tasks. The model achieves a zero-shot MS-COCO FID of 4.88 while utilizing 5x less training compute compared to equivalent models. This accomplishment underscores the efficacy of CM3Leon’s architecture in achieving high-quality image generation with reduced computational cost. CM3Leon's superiority in both text-to-image and image-to-text generation also highlights its versatility and adaptability across diverse tasks.

The paper reports that CM3Leon’s innovative contrastive decoding method further enhances the quality of generated outputs by incorporating self-guidance. This novel approach effectively complements Classifier Free Guidance (CFG), thereby expanding the model's decoding strategies and enhancing its transfer learning capabilities.

Theoretical implications indicate that retrieval-augmented autoregressive models, such as CM3Leon, hold promise for extending capabilities in multi-modal generative tasks beyond conventional diffusion models, which have dominated the recent landscape. Practically, the methodologies adopted for CM3Leon suggest pathways for scaling similar architectures in various multi-modal AI applications, enabling advancements in fields that require efficient and scalable generation of complex image-text pairings.

Overall, CM3Leon exemplifies advancements in multimodal LLMs, leveraging pretraining and fine-tuning techniques to establish a new paradigm in balancing efficiency with performance. Future research directions could explore combinations of retrieval-augmented autoregressive models with other forms of conditional guidance and their implications across expanding domains in artificial intelligence. The potential integration of CM3Leon-like frameworks in real-world applications could yield innovative solutions in content creation, enhanced human-computer interaction, and adaptive learning environments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (27)
  1. Lili Yu (28 papers)
  2. Bowen Shi (82 papers)
  3. Ramakanth Pasunuru (32 papers)
  4. Benjamin Muller (20 papers)
  5. Olga Golovneva (17 papers)
  6. Tianlu Wang (33 papers)
  7. Arun Babu (14 papers)
  8. Binh Tang (9 papers)
  9. Brian Karrer (41 papers)
  10. Shelly Sheynin (11 papers)
  11. Candace Ross (25 papers)
  12. Adam Polyak (29 papers)
  13. Russell Howes (6 papers)
  14. Vasu Sharma (31 papers)
  15. Puxin Xu (5 papers)
  16. Hovhannes Tamoyan (4 papers)
  17. Oron Ashual (8 papers)
  18. Uriel Singer (20 papers)
  19. Shang-Wen Li (55 papers)
  20. Susan Zhang (12 papers)
Citations (119)
Youtube Logo Streamline Icon: https://streamlinehq.com