Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation (2311.18775v1)

Published 30 Nov 2023 in cs.CV, cs.AI, cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: We present CoDi-2, a versatile and interactive Multimodal LLM (MLLM) that can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, edit, etc., in an any-to-any input-output modality paradigm. By aligning modalities with language for both encoding and generation, CoDi-2 empowers LLMs to not only understand complex modality-interleaved instructions and in-context examples, but also autoregressively generate grounded and coherent multimodal outputs in the continuous feature space. To train CoDi-2, we build a large-scale generation dataset encompassing in-context multimodal instructions across text, vision, and audio. CoDi-2 demonstrates a wide range of zero-shot capabilities for multimodal generation, such as in-context learning, reasoning, and compositionality of any-to-any modality generation through multi-round interactive conversation. CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation, vision transformation, and audio editing. CoDi-2 signifies a substantial breakthrough in developing a comprehensive multimodal foundation model adept at interpreting in-context language-vision-audio interleaved instructions and producing multimodal outputs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  2. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  3. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022.
  4. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  5. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  6. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
  7. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  8. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  9. Reversion: Diffusion-based relation inversion from images. arXiv preprint arXiv:2303.13495, 2023.
  10. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023.
  11. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
  12. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
  13. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
  14. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  15. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023b.
  16. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842, 2023.
  17. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  18. OpenAI. Gpt-4 technical report, 2023.
  19. Kosmos-g: Generating images in context with multimodal large language models. arXiv preprint arXiv:2310.02992, 2023a.
  20. Kosmos-g: Generating images in context with multimodal large language models, 2023b.
  21. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  22. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  23. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  24. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021.
  25. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.
  26. Any-to-any generation via composable diffusion. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  27. Stanford alpaca: An instruction-following llama model, 2023.
  28. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  29. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  30. Audit: Audio editing by following instructions with latent diffusion models. arXiv preprint arXiv:2304.00830, 2023a.
  31. In-context learning unlocked for diffusion models. arXiv preprint arXiv:2305.01115, 2023b.
  32. The generative ai paradox:” what it can create, it may not understand”. arXiv preprint arXiv:2311.00059, 2023.
  33. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
  34. i-code: An integrative and composable multimodal learning framework. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10880–10890, 2023.
  35. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  36. Ferret: Refer and ground anything anywhere at any granularity, 2023.
  37. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022.
  38. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  39. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  40. Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zineng Tang (13 papers)
  2. Ziyi Yang (77 papers)
  3. Mahmoud Khademi (17 papers)
  4. Yang Liu (2253 papers)
  5. Chenguang Zhu (100 papers)
  6. Mohit Bansal (304 papers)
Citations (30)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com