Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Many-Shot In-Context Learning in Multimodal Foundation Models (2405.09798v2)

Published 16 May 2024 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: LLMs are effective at few-shot in-context learning (ICL). Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows, presenting an opportunity to explore their capability to perform ICL with many more demonstrating examples. In this work, we evaluate the performance of multimodal foundation models scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro across 14 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (image classification, visual QA, and object localization). We observe that many-shot ICL, including up to almost 2,000 demonstrating examples, leads to substantial improvements compared to few-shot (<100 examples) ICL across all of the datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly up to the maximum number of tested examples on many datasets. We also find open-weights multimodal foundation models like Llama 3.2-Vision do not benefit from the demonstrating examples, highlighting an important gap between open and closed multimodal foundation models. Given the high inference costs required for many-shot ICL, we also explore the impact of batching multiple queries in a single API call. We show that batching up to 50 queries can lead to performance improvements under zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on multiple datasets, while drastically reducing per-query cost and latency. Finally, while GPT-4o and Gemini 1.5 Pro achieve similar zero-shot performance across the datasets, Gemini 1.5 Pro learns more quickly than GPT-4o on most datasets. Our results suggest that many-shot ICL could enable users to efficiently adapt multimodal foundation models to new applications and domains. Our codebase is publicly available at https://github.com/stanfordmlgroup/ManyICL .

Many-Shot In-Context Learning in Multimodal Foundation Models

Introduction

As LLMs continue to evolve, their ability to perform various tasks with minimal in-context examples has been a significant highlight. Recently, advancements in the context window size have opened up new opportunities, particularly in the "many-shot" in-context learning (ICL) space. This paper investigates the performance improvements brought by many-shot ICL using two state-of-the-art multimodal models: GPT-4o and Gemini 1.5. Pro.

Methods and Datasets

Models Used

  1. GPT-4o: Known for its robust performance across various NLP tasks.
  2. Gemini 1.5. Pro: A newer entrant allowing up to one million tokens, providing significantly larger context windows.

Datasets

The paper benchmarks performance across 10 datasets spanning multiple domains and tasks, including natural imagery, medical imagery, and molecular imagery. Here's a breakdown:

  • Natural Imagery: HAM10000, UCMerced, Oxford Pets, DTD
  • Medical Imagery: FIVES, CheXpert, Camelyon17
  • Remote Sensing: EuroSAT
  • Molecular Imagery: DrugOOD Assay

Each dataset is evaluated using performance metrics like accuracy and F1 score, and models are tested on multi-class, multi-label, and fine-grained classification tasks.

Key Findings

Performance Improvements

  1. Many-Shot Effectiveness:
    • Gemini 1.5. Pro consistently exhibited log-linear performance improvements as more demonstrating examples were added, with noticeable gains on datasets like HAM10000 (+23%), FIVES (+29%), and EuroSAT (+38%).
    • GPT-4o also improved with many-shot ICL but showed less stability compared to Gemini 1.5. Pro, with a V-shaped performance trend.
  2. ICL Data Efficiency:
    • Gemini 1.5. Pro outperformed GPT-4o in terms of ICL data efficiency on most datasets, with the highest efficiency observed on EuroSAT.

Impact of Batch Querying

Including a single query at a time proved suboptimal for many datasets. Batching (grouping multiple queries) showed:

  • Minimal to no degradation in performance for Gemini 1.5. Pro across large batch sizes.
  • Substantial latency and cost savings.

Cost and Latency Analysis

Many-shot ICL can be computationally expensive due to long input contexts. With batching:

  • A near 35x reduction in latency and 10x in cost was observed for HAM10000.
  • For TerraIncognita, latency reduced by 20x, and cost by 45x.

Implications and Future Directions

Practical Implications:

  1. Operational Efficiency: Many-shot ICL can significantly enhance the adaptability of multimodal models, allowing for quick adaptation to new tasks without the need for extensive fine-tuning.
  2. Cost Reduction: The ability to batch queries effectively reduces computational costs and inference latency, making the deployment of these models more feasible in real-world applications.

Theoretical Implications:

  1. Model Robustness: The findings suggest that with larger context windows, models can leverage more in-context examples, leading to improved robustness and performance consistency.
  2. Understanding Model Behavior: Investigating the reasons behind the improvements seen with batch querying, such as domain and class calibration, provides deeper insights into how models can be further optimized.

Looking Ahead

Many-shot ICL represents a substantial stride for multimodal foundation models. As context window sizes continue to expand, the ability to leverage a large number of demonstrating examples will likely improve further. Ongoing research should explore:

  • Comparative studies with traditional fine-tuning to evaluate performance and efficiency trade-offs.
  • Detailed investigation of biases and hallucinations in the context of many-shot ICL.
  • Extension to other tasks and open-source multimodal models.

In summary, this paper underscores the capability of multimodal foundation models to benefit significantly from many-shot ICL, paving the way for more efficient and adaptable AI applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Language models are few-shot learners, 2020.
  2. Learning from few examples: A summary of approaches to few-shot learning. arXiv preprint arXiv:2203.04291, 2022.
  3. Generalizing from a few examples: A survey on few-shot learning. ACM computing surveys (csur), 53(3):1–34, 2020.
  4. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  5. How well does gpt-4v (ision) adapt to distribution shifts? a preliminary investigation. arXiv preprint arXiv:2312.07424, 2023.
  6. On the out-of-distribution generalization of multimodal large language models. arXiv preprint arXiv:2402.06599, 2024.
  7. In-context learning with many demonstration examples. arXiv preprint arXiv:2302.04931, 2023.
  8. Many-shot in-context learning. arXiv preprint arXiv:2404.11018, 2024.
  9. In-context learning with long-context models: An in-depth exploration. arXiv preprint arXiv:2405.00200, 2024.
  10. Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915, 2023.
  11. Batch prompting: Efficient inference with large language model apis. arXiv preprint arXiv:2301.08721, 2023.
  12. Batchprompt: Accomplish more with less. arXiv preprint arXiv:2309.00384, 2023.
  13. Cliqueparcel: An approach for batching llm prompts that jointly optimizes efficiency and faithfulness. arXiv preprint arXiv:2402.14833, 2024.
  14. Multi-task inference: Can large language models follow multiple instructions at once? arXiv preprint arXiv:2402.11597, 2024.
  15. Collage prompting: Budget-friendly visual recognition with gpt-4v. arXiv preprint arXiv:2403.11468, 2024.
  16. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  17. Dettoolchain: A new prompting paradigm to unleash detection ability of mllm. arXiv preprint arXiv:2403.12488, 2024.
  18. Contextual object detection with multimodal large language models. arXiv preprint arXiv:2305.18279, 2023.
  19. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data, 5(1):1–9, 2018.
  20. Fives: A fundus image dataset for artificial intelligence based vessel segmentation. Scientific Data, 9(1):475, 2022.
  21. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597, 2019.
  22. From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge. IEEE transactions on medical imaging, 38(2):550–560, 2018.
  23. Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV), pages 456–473, 2018.
  24. Yi Yang and Shawn Newsam. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, pages 270–279, 2010.
  25. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
  26. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
  27. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014.
  28. Drugood: Out-of-distribution (ood) dataset curator and benchmark for ai-aided drug discovery–a focus on affinity prediction problems with noise annotations. arXiv preprint arXiv:2201.09637, 2022.
  29. Self-icl: Zero-shot in-context learning with self-generated demonstrations. arXiv preprint arXiv:2305.15035, 2023.
  30. Introducing meta llama 3: The most capable openly available llm to date. URL https://ai.meta.com/blog/meta-llama-3/.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yixing Jiang (4 papers)
  2. Jeremy Irvin (18 papers)
  3. Ji Hun Wang (2 papers)
  4. Muhammad Ahmed Chaudhry (2 papers)
  5. Jonathan H. Chen (17 papers)
  6. Andrew Y. Ng (55 papers)
Citations (13)
Youtube Logo Streamline Icon: https://streamlinehq.com