Generative Multimodal Models are In-Context Learners (2312.13286v2)

Published 20 Dec 2023 in cs.CV

Abstract: The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research.

Citations (167)

View on Semantic Scholar

Summary

The paper demonstrates that scaling to 37B parameters with massive multimodal data significantly enhances in-context learning performance.
Emu2 employs an autoregressive training objective and few-shot examples to set new records in complex text-image understanding.
The study highlights Emu2’s versatile applications in visual generation and reasoning while emphasizing the need for responsible AI deployment.

Introduction to Generative Multimodal Models

Generative multimodal models aim to bring AI closer to human-like understanding and creation involving multiple forms of data, such as text, images, and video. Their goal is to interpret and generate content in ways that combine these different modalities, much like how humans engage with the world using multiple senses.

Emu2: A Leap in Multimodal Learning

The newly introduced Emu2 is a state-of-the-art generative multimodal model containing 37 billion parameters. Trained on a massive scale of multimodal sequences with an autoregressive objective, Emu2 has demonstrated remarkable capabilities in context-driven tasks, manifesting adeptness in both understanding and generating multimodal content. The research presented shows how scaling up in architecture and data can significantly enhance a model's in-context learning ability, pushing it toward performing tasks that require on-the-fly reasoning, such as visual prompt understanding and object-focused generation.

Evaluating Emu2's Capabilities

Emu2 has been rigorously evaluated under different scenarios. For few-shot settings, where the model is given a handful of examples to learn from, it exhibits powerful in-context learning skills and even sets new records on multiple multimodal understanding tasks. Instruction tuning, another evaluation scenario, further refines Emu2's performance, allowing it to achieve unprecedented results on complex challenges such as question-answering in large multimodal scenarios.

Advanced Applications and Future Implications

Emu2 is designed as a foundational model which can serve as a versatile interface across a sprawling range of tasks involving text and visuals. As a controllable visual generator, Emu2 can synthesize high-quality images based on a mix of text, image locations, and conditions, demonstrating exceptional in-context creativity. The paper also explores the societal implications of such powerful models, including their potential for misuse, and underscores the need for future enhancements and responsible deployment of such technologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/2603460685/status/1738161395785752944