Otter: A Multi-Modal Model with In-Context Instruction Tuning (2305.03726v1)

Published 5 May 2023 in cs.CV and cs.CL

Abstract: LLMs have demonstrated significant universal capabilities as few/zero-shot learners in various tasks due to their pre-training on vast amounts of text data, as exemplified by GPT-3, which boosted to InstrctGPT and ChatGPT, effectively following natural language instructions to accomplish real-world tasks. In this paper, we propose to introduce instruction tuning into multi-modal models, motivated by the Flamingo model's upstream interleaved format pretraining dataset. We adopt a similar approach to construct our MultI-Modal In-Context Instruction Tuning (MIMIC-IT) dataset. We then introduce Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following ability and in-context learning. We also optimize OpenFlamingo's implementation for researchers, democratizing the required training resources from 1$\times$ A100 GPU to 4$\times$ RTX-3090 GPUs, and integrate both OpenFlamingo and Otter into Huggingface Transformers for more researchers to incorporate the models into their customized training and inference pipelines.

PDF Abstract

Otter: A Multi-Modal Model with In-Context Instruction Tuning

The paper "Otter: A Multi-Modal Model with In-Context Instruction Tuning" presents a refined approach to enhancing instruction-following capabilities within multi-modal models. Leveraging the principles of instruction tuning in LLMs, this paper centers around the integration of instruction tuning into multi-modal learning contexts, using the model Otter as a case paper.

Model Design and Objectives

Otter builds upon OpenFlamingo, an open-sourced version of DeepMind's Flamingo, adopting a fine-tuning methodology that capitalizes on the MultI-Modal In-Context Instruction Tuning (MIMIC-IT) dataset. The principal aim is to improve the model's capability to follow instructions and perform in-context learning, thereby enabling more effective execution of real-world tasks across different modalities.

The researchers aim to democratize the availability of training resources by significantly reducing their computational requirements. The transition from a dependence on an A100 GPU to 4 $\times$ RTX-3090 GPUs is a notable advancement, facilitating broader access for researchers.

Dataset and Methodology

The MIMIC-IT dataset, central to Otter's design, comprises image-instruction-answer triplets and contextual examples, offering a rich platform for the model to learn instruction-following. This approach aligns visual and linguistic data more naturally, contrasting with conventional task-dependent alignment practices.

The model employs cross-attention layers and a Perceiver resampler module for effective multi-modal integration, with training conducted via the AdamW optimizer. Otter's training paradigm is designed to preserve OpenFlamingo’s original in-context learning strengths while refining its instruction comprehension capabilities.

Results and Implications

Otter shows marked improvements in instruction-following abilities compared to its predecessor, OpenFlamingo. The model exhibits enhanced performance in providing detailed, relevant responses, demonstrating significant benefits in visual question answering and complex situational understanding tasks.

Otter also retains robust in-context learning abilities, engaging effectively with user-provided examples to enhance understanding and inference. This characteristic positions Otter as a highly adaptable multi-modal platform.

Discussion and Future Directions

The practical implications of Otter are profound, particularly in applications requiring seamless multi-modal interaction and understanding. The integration with Hugging Face Transformers further broadens its accessibility, simplifying incorporation into diverse research and development workflows.

Looking ahead, the researchers suggest improvements in addressing language hallucination issues and the potential exploration of parameter-efficient fine-tuning strategies like LoRA. Expanding Otter’s capabilities across additional modalities, including 3D vision, offers promising avenues for future exploration.

Conclusion

The development of Otter underscores a significant milestone in the evolutionary trajectory of multi-modal models. By effectively harnessing in-context instruction tuning, Otter exemplifies progress in bridging the gap between visual and textual learning modalities. This paper provides a robust foundation for further advancements in multi-modal AI research, offering enhanced capabilities and accessibility for the research community.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Bo Li (1107 papers)
Yuanhan Zhang (29 papers)
Liangyu Chen (50 papers)
Jinghao Wang (6 papers)
Jingkang Yang (36 papers)
Ziwei Liu (368 papers)

Citations (453)

View on Semantic Scholar

Otter: A Multi-Modal Model with In-Context Instruction Tuning (2305.03726v1)