Many-Shot In-Context Learning in Multimodal Foundation Models (2405.09798v2)

Published 16 May 2024 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: LLMs are effective at few-shot in-context learning (ICL). Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows, presenting an opportunity to explore their capability to perform ICL with many more demonstrating examples. In this work, we evaluate the performance of multimodal foundation models scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro across 14 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (image classification, visual QA, and object localization). We observe that many-shot ICL, including up to almost 2,000 demonstrating examples, leads to substantial improvements compared to few-shot (<100 examples) ICL across all of the datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly up to the maximum number of tested examples on many datasets. We also find open-weights multimodal foundation models like Llama 3.2-Vision do not benefit from the demonstrating examples, highlighting an important gap between open and closed multimodal foundation models. Given the high inference costs required for many-shot ICL, we also explore the impact of batching multiple queries in a single API call. We show that batching up to 50 queries can lead to performance improvements under zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on multiple datasets, while drastically reducing per-query cost and latency. Finally, while GPT-4o and Gemini 1.5 Pro achieve similar zero-shot performance across the datasets, Gemini 1.5 Pro learns more quickly than GPT-4o on most datasets. Our results suggest that many-shot ICL could enable users to efficiently adapt multimodal foundation models to new applications and domains. Our codebase is publicly available at https://github.com/stanfordmlgroup/ManyICL .

PDF HTML Abstract

Many-Shot In-Context Learning in Multimodal Foundation Models

Introduction

As LLMs continue to evolve, their ability to perform various tasks with minimal in-context examples has been a significant highlight. Recently, advancements in the context window size have opened up new opportunities, particularly in the "many-shot" in-context learning (ICL) space. This paper investigates the performance improvements brought by many-shot ICL using two state-of-the-art multimodal models: GPT-4o and Gemini 1.5. Pro.

Methods and Datasets

Models Used

GPT-4o: Known for its robust performance across various NLP tasks.
Gemini 1.5. Pro: A newer entrant allowing up to one million tokens, providing significantly larger context windows.

Datasets

The paper benchmarks performance across 10 datasets spanning multiple domains and tasks, including natural imagery, medical imagery, and molecular imagery. Here's a breakdown:

Natural Imagery: HAM10000, UCMerced, Oxford Pets, DTD
Medical Imagery: FIVES, CheXpert, Camelyon17
Remote Sensing: EuroSAT
Molecular Imagery: DrugOOD Assay

Each dataset is evaluated using performance metrics like accuracy and F1 score, and models are tested on multi-class, multi-label, and fine-grained classification tasks.

Key Findings

Performance Improvements

Many-Shot Effectiveness:
- Gemini 1.5. Pro consistently exhibited log-linear performance improvements as more demonstrating examples were added, with noticeable gains on datasets like HAM10000 (+23%), FIVES (+29%), and EuroSAT (+38%).
- GPT-4o also improved with many-shot ICL but showed less stability compared to Gemini 1.5. Pro, with a V-shaped performance trend.
ICL Data Efficiency:
- Gemini 1.5. Pro outperformed GPT-4o in terms of ICL data efficiency on most datasets, with the highest efficiency observed on EuroSAT.

Impact of Batch Querying

Including a single query at a time proved suboptimal for many datasets. Batching (grouping multiple queries) showed:

Minimal to no degradation in performance for Gemini 1.5. Pro across large batch sizes.
Substantial latency and cost savings.

Cost and Latency Analysis

Many-shot ICL can be computationally expensive due to long input contexts. With batching:

A near 35x reduction in latency and 10x in cost was observed for HAM10000.
For TerraIncognita, latency reduced by 20x, and cost by 45x.

Implications and Future Directions

Practical Implications:

Operational Efficiency: Many-shot ICL can significantly enhance the adaptability of multimodal models, allowing for quick adaptation to new tasks without the need for extensive fine-tuning.
Cost Reduction: The ability to batch queries effectively reduces computational costs and inference latency, making the deployment of these models more feasible in real-world applications.

Theoretical Implications:

Model Robustness: The findings suggest that with larger context windows, models can leverage more in-context examples, leading to improved robustness and performance consistency.
Understanding Model Behavior: Investigating the reasons behind the improvements seen with batch querying, such as domain and class calibration, provides deeper insights into how models can be further optimized.

Looking Ahead

Many-shot ICL represents a substantial stride for multimodal foundation models. As context window sizes continue to expand, the ability to leverage a large number of demonstrating examples will likely improve further. Ongoing research should explore:

Comparative studies with traditional fine-tuning to evaluate performance and efficiency trade-offs.
Detailed investigation of biases and hallucinations in the context of many-shot ICL.
Extension to other tasks and open-source multimodal models.

In summary, this paper underscores the capability of multimodal foundation models to benefit significantly from many-shot ICL, paving the way for more efficient and adaptable AI applications.