Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MLGym: A New Framework and Benchmark for Advancing AI Research Agents (2502.14499v1)

Published 20 Feb 2025 in cs.CL, cs.AI, and cs.LG
MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Abstract: We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. This is the first Gym environment for ML tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task. We evaluate a number of frontier LLMs on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks. We find that current frontier models can improve on the given baselines, usually by finding better hyperparameters, but do not generate novel hypotheses, algorithms, architectures, or substantial improvements. We open-source our framework and benchmark to facilitate future research in advancing the AI research capabilities of LLM agents.

The paper introduces lmr and lmr-Bench, a novel framework and benchmark designed for the development and evaluation of LLM agents in the context of AI research tasks. It addresses the gap in existing benchmarks by providing a Gym environment tailored for machine learning tasks, enabling the use of reinforcement learning algorithms for training agents. The authors evaluate frontier LLMs, finding that while these models can enhance existing baselines through hyperparameter optimization, they often fall short of generating genuinely novel hypotheses, algorithms, architectures, or substantial improvements.

The paper's contributions can be summarized as follows:

  • Introduction of lmr, the first Gym environment specifically designed for evaluating and developing AI Research Agents.
  • Release of lmr-Bench, a diverse suite of open-ended AI research tasks intended for assessing LLM agents.
  • Proposal of a new evaluation metric that facilitates the comparison of multiple agents across a variety of tasks.
  • Comprehensive evaluation of frontier LLMs using lmr-Bench.
  • Open-sourcing the framework and benchmark to encourage future research in the advancement of AI research capabilities of LLM agents.

Capability Levels for AI Research Agents

The authors propose a six-level hierarchy for categorizing LLM agent capabilities:

  • Level 0: Reproduction: Agents can reproduce existing research papers.
  • Level 1: Baseline Improvement: Agents can improve performance on a benchmark given baseline code.
  • Level 2: SOTA Achievement: Agents can achieve state-of-the-art (SOTA) performance given a task description and pre-existing literature.
  • Level 3: Novel Scientific Contribution: Agents can develop a new method worthy of publication at a top ML conference.
  • Level 4: Groundbreaking Scientific Contribution: Agents can identify key research questions and directions, meriting awards at prestigious ML conferences.
  • Level 5: Long-Term Research Agenda: Agents can pursue long-term research, achieving paradigm-shifting breakthroughs worthy of prizes like the Nobel or Turing.

The lmr-Bench focuses primarily on Level 1.

Related Work

The authors present a comparison between lmr and lmr-Bench with other related LLM agent frameworks and benchmarks, including:

  • MLE-Bench
  • SWE-Bench/Agent
  • MLAgentBench
  • RE-Bench
  • ScienceAgentBench

The comparison highlights lmr's unique features, such as its Gym interface, inclusion of algorithmic tasks, focus on open-ended research, flexible artifact evaluation, and agentic harness.

The lmr Framework

The lmr framework facilitates ML research and development via agent interaction with a shell environment, utilizing task descriptions, starter code, and action/observation history to generate shell commands.

The framework comprises Agents, Environments, Datasets, and Tasks:

  • Agents: Agents are wrappers around base LLMs, integrating models, history processors, and cost management.
  • Environment: Environments are designed as Gymnasium (gym) environments, initializing a shell environment within a local Docker machine with tools and task-specific dependencies.
  • Datasets: Datasets are defined via configuration files and support both local and Hugging Face datasets.
  • Tasks: Tasks define ML research tasks using configuration files, incorporating datasets, evaluation scripts, and training timeouts.

Tools and ACI

The framework extends the Agent-Computer Interface (ACI) from SWE-Agent, incorporating literature search and a memory module.

The available tools include:

  • SWE-Agent Tools: Includes tools for Search, File Viewing, and File Editing.
  • Extended Tools: Encompass literature search, PDF parsing, and a memory module for storing and retrieving experimental findings.

The memory module is a tool to improve performance on long-horizon AI research tasks. It has two core functions: memory_write and memory_read. The memory_write function allows the agent to store key insights and effective configurations and the memory_read method retrieves the top-k most relevant stored entries based on cosine similarity with a given query.

lmr-Bench Tasks

lmr-Bench includes a variety of tasks:

  • Data Science: House Price Prediction using the Kaggle House Price dataset.
  • 3-SAT: Optimizing a variable selection heuristic for a DPLL solver.
  • Game Theory: Developing strategies for iterated Prisoner's Dilemma, Battle of the Sexes, and Colonel Blotto games.
  • Computer Vision: Image classification tasks using CIFAR-10 and Fashion MNIST datasets, and image captioning using the MS-COCO dataset.
  • Natural Language Processing: Natural Language Inference using MNLI and LLMing with the FineWeb dataset.
  • Reinforcement Learning: MetaMaze Navigation, Mountain Car Continuous, and Breakout MinAtar environments.

Experimental Setup

Experiments were conducted using a SWE-Agent-based model with frontier models, including OpenAI O1-preview, Gemini 1.5 Pro, Claude-3.5-sonnet, Llama-3-405b-instruct, and GPT-4o. The environment was configured with a window size of 1000 lines, a context processor maintaining a rolling window, and specialized commands for file navigation, editing, and evaluation.

Evaluation Metrics

The authors use performance profile curves and Area Under the Performance profile (AUP) scores to compare agent performance across tasks. The AUP score is calculated as:

$\text{AUP}_m = \int_{1}^{\tau_\text{max} \rho_{m}(\tau) d{\tau}$

Where:

  • AUPm\text{AUP}_m is the area under the performance profile for method mm
  • τ\tau is a threshold on the distance between the method mm and the best scoring methods on each of the tasks.
  • τmax\tau_\text{max} is the minimum τ\tau for which ρm(τ)=1\rho_{m}(\tau) = 1 for all mMm \in M
  • ρm(τ)\rho_{m}(\tau) is the performance profile curve for method mm

Additional metrics include Best Submission and Best Attempt scores across four independent runs.

Results and Analysis

The results indicate that OpenAI's O1-preview generally achieves the best aggregate performance across tasks, followed by Gemini 1.5 Pro and Claude-3.5-Sonnet. Cost analysis reveals that Gemini-1.5-Pro offers a better balance between performance and computational cost. Action analysis indicates that agents spend significant time in iterative development cycles of editing and viewing files. GPT-4o takes the least number of actions overall, indicating that the model either errors out or submits too early without reaching an optimal solution. Analysis of termination errors showed that evaluation error occurred most frequently, usually due to missing or incorrect formatting of submission artifacts.

Discussion and Limitations

The authors highlight opportunities and challenges in using LLMs for scientific workflows. Scaling the evaluation framework to accommodate larger datasets and more complex tasks is essential. Additionally, there is a need to address scientific novelty and promote data openness to drive scientific progress.

Ethical Considerations

The authors acknowledge that AI agents proficient in tackling open research challenges could catalyze scientific advancement, while also recognizing that this progress must be carefully controlled and responsibly deployed. They also point out the ability of agents to enhance their own training code could augment the capabilities of cutting-edge models at a pace outstripping human researchers.

The paper concludes by emphasizing the need for improvements in long-context reasoning, better agent architectures, and richer evaluation methodologies to fully harness LLMs’ potential for scientific discovery.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (17)
  1. Deepak Nathani (8 papers)
  2. Lovish Madaan (10 papers)
  3. Nicholas Roberts (24 papers)
  4. Nikolay Bashlykov (3 papers)
  5. Ajay Menon (2 papers)
  6. Vincent Moens (12 papers)
  7. Amar Budhiraja (4 papers)
  8. Despoina Magka (2 papers)
  9. Vladislav Vorotilov (1 paper)
  10. Gaurav Chaurasia (2 papers)
  11. Dieuwke Hupkes (49 papers)
  12. Ricardo Silveira Cabral (4 papers)
  13. Tatiana Shavrina (17 papers)
  14. Jakob Foerster (100 papers)
  15. Yoram Bachrach (43 papers)
  16. William Yang Wang (254 papers)
  17. Roberta Raileanu (40 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com