LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations (2412.01441v3)

Published 2 Dec 2024 in cs.AI and cs.LG

Abstract: In this paper, we present a benchmark to pressure-test today's frontier models' multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether these models can learn from large numbers of expert demonstrations in their context. We evaluate the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash Experimental, GPT-4o, o1-mini, o1-preview, and o1 as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We study increasing amounts of expert demonstrations in the context $\unicode{x2013}$ from no demonstrations to 512 full episodes. Across our tasks, models rarely manage to fully reach expert performance, and often, presenting more demonstrations has little effect. Some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. To help quantify the impact of other approaches and future innovations, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.

Summary

The paper introduces LMAct, a benchmark for evaluating large language models' in-context imitation learning from long multimodal demonstrations.
Experiments revealed that state-of-the-art models rarely achieve full expert performance, highlighting a significant knowing-doing gap despite extensive context.
The open-source LMAct benchmark enables future research to enhance AI decision-making by providing a standard platform for evaluating and comparing methods.

Evaluating Multimodal In-Context Imitation Learning with LMAct

The paper "LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations" presented by Ruoss et al. from Google DeepMind addresses critical challenges in assessing the decision-making capabilities of contemporary large-scale LLMs (LMs) when functioning as agents. These models, while proficient in generating text outputs, encounter marked difficulties in decision-making tasks, despite possessing comprehensive knowledge about the tasks. The focus of the paper is to evaluate if these models can learn from substantial expert demonstrations embedded in their context, especially in a multimodal and extended context scenario, stretching up to one million tokens.

Core Research Methodology and Contributions

The authors introduce LMAct, a comprehensive benchmark designed for pressure-testing LMs in their ability to learn from multimodal in-context demonstrations. Evaluations encompassed varied interactive decision-making tasks such as playing tic-tac-toe, chess, and At performance task complexity by incorporating a spectrum of demonstration episodes, ranging from zero to 512 instances.

Key contributions of this research include:

Empirical Evaluation: The paper provides an extensive empirical assessment of current state-of-the-art LMs, such as Claude 3.5. Sonnet, Gemini 1.5 variants, GPT-4o, and o1 models, in executing tasks potentially extending over long horizons.
Unified Evaluation Regime: It explores zero-, few-, and many-shot regimes providing insights into models' adaptability to varying context sizes.
Prompt Optimization: Continuous optimization of prompt formats across tasks and models highlighted the challenges faced by these models in achieving expert-level scores.
Open-Source Benchmark: By releasing the benchmark, along with associated expert demonstrations and evaluation codes, they provide a platform for future research and development towards tackling the identified challenges.

Observations and Findings

The authors report that despite the proficiency of modern LMs, full expert performance is rarely achieved across tasks. The experiments delineated how providing more demonstrations often yielded little improvement, indicating challenges in scaling showing legal actions is infeasible) are highlighting a "knowing-doing gap" within current LMs. Of particular note is the performance variations observed based on input representations. For instance, models demonstrated different levels of competence with various input formats such as text-based versus image-based inputs.

Task performance varied significantly, with models showing competency in simple tasks like grid world navigation but struggling in others, such as chess against even the weakest Stockfish version. For some tasks, such as tic-tac-toe with many demonstrations, actions could be inferred or "read off" from the context, yet many models couldn't fully capitalize on the potential cues.

Implications and Future Directions

The results underscore a substantial gap in translating declarative task knowledge into decision-making competencies — a discrepancy that needs more than mere scale to address. Future research may necessitate integrating interaction data preparation stages or employing advanced sampling schemes to potentially bridge this gap.

Furthermore, the benchmark alleviates a critical gap in evaluating in-context learning capabilities for interactive, decision-making environments, paving the way for comparing diverse approaches such as fine-tuning, retrieval-based methods, or reward-conditioning techniques. The findings also reveal the potential scope for refining models specifically for interactive tasks, especially regarding computational efficiencies and action generation consistency.

In summary, "LMAct" establishes a foundational evaluation regimen for comprehending and enhancing the in-context imitation learning capabilities of large-scale foundation models, steering future exploration and advancement in AI decision-making paradigms. The open-source nature of this work significantly contributes to the broader AI research community’s ability to innovate and refine AI behaviors in structured environments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/agarwl_/status/1878109481915039801

https://twitter.com/_k_sridhar/status/1916145993328037931

https://twitter.com/rohanpaul_ai/status/1865904204742815889

https://twitter.com/anianruoss/status/1864825462641291509

https://twitter.com/anianruoss/status/1916685337889505756

https://twitter.com/anianruoss/status/1887884823353610647

Reddit

LMAct Benchmark for In-Context Imitation Learning {DM} (icl does not scale reliably) (3 points, 3 comments)