Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Evaluating Generalist Agents: An Automated Benchmark in Open World (2310.08367v2)

Published 12 Oct 2023 in cs.AI, cs.CL, cs.CV, and cs.LG

Abstract: Evaluating generalist agents presents significant challenges due to their wide-ranging abilities and the limitations of current benchmarks in assessing true generalization. We introduce the Minecraft Universe (MCU), a fully automated benchmarking framework set within the open-world game Minecraft. MCU dynamically generates and evaluates a broad spectrum of tasks, offering three core components: 1) a task generation mechanism that provides high degrees of freedom and variability, 2) an ever-expanding set of over 3K composable atomic tasks, and 3) a general evaluation framework that supports open-ended task assessment. By integrating LLMs, MCU dynamically creates diverse environments for each evaluation, fostering agent generalization. The framework uses a vision-LLM (VLM) to automatically generate evaluation criteria, achieving over 90% agreement with human ratings across multi-dimensional assessments, which demonstrates that MCU is a scalable and explainable solution for evaluating generalist agents. Additionally, we show that while state-of-the-art foundational models perform well on specific tasks, they often struggle with increased task diversity and difficulty.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Haowei Lin (21 papers)
  2. Zihao Wang (216 papers)
  3. Yitao Liang (53 papers)
  4. Xinyue Zheng (5 papers)
  5. Kaichen He (3 papers)
  6. Zilong Zheng (63 papers)
Citations (13)
Github Logo Streamline Icon: https://streamlinehq.com