Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Emu: Generative Pretraining in Multimodality (2307.05222v2)

Published 11 Jul 2023 in cs.CV
Emu: Generative Pretraining in Multimodality

Abstract: We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.

Analysis of "Generative Pretraining in Multimodality"

The paper under review introduces Emu, a Transformer-based multimodal foundation model designed to generate images and texts from a multimodal context. The Emu model is notable for its ability to handle inputs from various modalities—such as text, images, and video—without discrimination. By utilizing a one-model-for-all autoregressive training process, Emu is trained to predict the next text token or to regress the next visual embedding within a sequence. This approach stands out due to its seamless integration of diverse data sources at scale.

Model Architecture and Training

The architecture of Emu is constructed of several components: a Visual Encoder using EVA-CLIP, a Causal Transformer for transforming visual signals to a latent space, a Multimodal Modeling component leveraging LLaMA, and a Visual Decoder initialized with Stable Diffusion. The training involves a unified autoregressive objective aimed at predicting the next element in a multimodal sequence, applying cross-entropy classification loss for text tokens and L2 regression loss for visual embeddings. Key to its design is the Causal Transformer, which transforms spatial visual signals into 1D sequences within a latent space, bypassing the traditional image generation in pixel space.

Emu is pretrained on expansive datasets including LAION-2B, LAION-COCO, MMC4, WebVid-10M, and the newly introduced YT-Storyboard-1B. Training is executed using large-scale infrastructure, optimizing parameters across batch sizes tailored to different dataset modalities.

Evaluation and Results

Emu's performance is rigorously evaluated across a variety of tasks: image captioning, visual question answering, video question answering, and text-to-image generation. During zero-shot evaluations, Emu surpasses state-of-the-art models in multiple benchmarks. The introduction of few-shot prompting enhances its task-specific performance further. Additionally, Emu showcases in-context learning abilities, highlighting its capacity to handle tasks with minimal examples effectively.

Significantly, the paper reports that Emu achieves a zero-shot CIDEr score of 112.4 in image captioning on the COCO benchmark—a substantial improvement over contemporary models. The instructional tuning of Emu (referred to as Emu-I) is noteworthy, aligning the model well with human intent and demonstrating considerable advancements in performance metrics compared to several larger models.

Implications and Future Directions

Emu's contributions are multifaceted. The model's ability to perform diverse tasks such as image captioning and text-to-image generation positions it as a generalist multimodal interface. Emu's framework underlines the potential benefits of large-scale, diverse data integration, particularly when video-text datasets are incorporated into training.

The implications of this research extend into theoretical advancements in multimodal Transformer architectures and practical applications in deploying LMMs for real-world use cases. Future developments could explore refining the model's text-to-image generation capability, potentially enhancing the fidelity and relevance of generated visuals via more extensive fine-tuning or alternative architectures. Additionally, Emu's adoption of video-derived data opens avenues for richer, more dynamic AI applications in video content understanding and generation.

Overall, the paper provides a comprehensive evaluation of a robust and versatile multimodal model, setting a new benchmark in the field of multimodal AI research. The inclusion of diverse multimodal data and the unified training approach presents compelling directions for further exploration in multimodal AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Laion-aesthetics. https://laion.ai/blog/laion-aesthetics/.
  2. Laion coco: 600m synthetic captions from laion2b-en. https://laion.ai/blog/laion-coco/.
  3. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  4. Openflamingo, 2023.
  5. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  6. Beit: BERT pre-training of image transformers. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  7. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Pali: A jointly-scaled multilingual language-image model. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  10. UNITER: universal image-text representation learning. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX, volume 12375 of Lecture Notes in Computer Science, pages 104–120. Springer, 2020.
  11. Unifying vision-and-language tasks via text generation. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 1931–1942. PMLR, 2021.
  12. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  13. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  14. Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500, 2023.
  15. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335, 2017.
  16. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  18. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  19. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, pages 89–106. Springer, 2022.
  20. Vision-language pre-training: Basics, recent advances, and future trends. Found. Trends Comput. Graph. Vis., 14(3-4):163–352, 2022.
  21. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  22. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  23. Language models are general-purpose interfaces. arXiv preprint arXiv:2206.06336, 2022.
  24. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  25. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  26. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  27. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  28. Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916. PMLR, 2021.
  29. Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
  30. Vilt: Vision-and-language transformer without convolution or region supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 5583–5594. PMLR, 2021.
  31. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023.
  32. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
  33. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  34. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  35. Align before fuse: Vision and language representation learning with momentum distillation. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 9694–9705, 2021.
  36. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
  37. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  38. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023.
  39. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  40. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022.
  41. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  42. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  43. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  44. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  45. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  46. Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
  47. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021.
  48. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  49. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  50. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  51. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.
  52. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  53. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  54. Chatgpt: Optimizing language models for dialogue. OpenAI blog, 2022.
  55. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
  56. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  57. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  58. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  59. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  60. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
  61. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
  62. Image as a foreign language: Beit pretraining for vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175–19186, 2023.
  63. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  64. Simvlm: Simple visual language model pretraining with weak supervision. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  65. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  66. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786, June 2021.
  67. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653, 2017.
  68. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
  69. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  70. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  71. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  72. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022.
  73. GLM-130B: an open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  74. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  75. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  76. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Quan Sun (31 papers)
  2. Qiying Yu (13 papers)
  3. Yufeng Cui (12 papers)
  4. Fan Zhang (685 papers)
  5. Xiaosong Zhang (29 papers)
  6. Yueze Wang (14 papers)
  7. Hongcheng Gao (28 papers)
  8. Jingjing Liu (139 papers)
  9. Tiejun Huang (130 papers)
  10. Xinlong Wang (56 papers)
Citations (110)
Youtube Logo Streamline Icon: https://streamlinehq.com