MLGym-Bench: LLM Benchmark Suite
- MLGym-Bench is a benchmark suite that evaluates LLM agents on diverse AI research tasks using a formally defined Gym interface.
- It comprises 13 open-ended tasks across domains like computer vision, NLP, reinforcement learning, data science, and game theory.
- The framework emphasizes reproducibility, modularity, and cost-efficient performance metrics to drive rapid prototyping of novel agent strategies.
MLGym-Bench is a benchmark suite and environment specifically designed for evaluating the research capabilities of LLM agents on real-world AI research tasks. Introduced as part of the MLGym framework, it establishes a formally defined Gym interface for machine learning problem-solving, supporting research on reinforcement learning (RL) algorithms for agent training. MLGym-Bench presents 13 diverse, open-ended tasks covering computer vision, NLP, reinforcement learning, data science, and game theory, each demanding skills in hypothesis generation, data manipulation, algorithm implementation, model training, experimental analysis, and iterative solution refinement. The benchmark is open-sourced to promote reproducibility and extensibility within the AI research community (Nathani et al., 20 Feb 2025).
1. Formalization and Environment Structure
MLGym-Bench recasts AI research tasks as Markov Decision Processes (MDPs), defined as
where:
- is the state space, representing the agent's workspace, including filesystem snapshots, action/observation history, and any memory modules.
- is the action space, defined by a set of tokenized shell-like commands. These include generic file and code manipulation tools (e.g.,
open,edit,search_dir), as well as high-level tools (e.g.,validate,submit,literature_search). - is the transition function. Actions deterministically transition the environment and yield observation strings such as command output or errors.
- is the reward function. Nonzero rewards are only offered for
validate(intermediate test-set metric) andsubmit(final test metric) commands:
- is the discount factor, typically set to 1.0 because tasks are finite-horizon.
Observations consist of recent command output, a sliding window into the currently accessed file (up to 1000 lines), remaining step budget, elapsed time, and optional memory summaries. Actions are structured as bash-like commands, and one is generated per step.
2. Task Suite Composition and Challenges
MLGym-Bench comprises 13 tasks across four core domains, each demanding surpassing baseline solutions within fixed time or step budgets.
| Domain | Example Task | Metric |
|---|---|---|
| Data Science | House Price Prediction | (higher is better) |
| Algorithmic Reasoning/SAT | 3-SAT Heuristic Optimization | Average solve time (lower) |
| Game Theory (Repeated NFG) | Iterated Prisoner's Dilemma | Avg. round payoff (higher) |
| Computer Vision | CIFAR-10 Classification | Accuracy (higher) |
| NLP | MNLI/Natural Language Inference | Accuracy (higher) |
| Reinforcement Learning | MetaMaze Navigation | Avg. return (higher) |
All tasks are initialized with baseline code or scripts and datasets. Challenges are tailored to domain-specific activities:
- Data Science: Emphasis on feature engineering and hyperparameter tuning.
- SAT/Algorithmic Reasoning: Writing custom heuristics for faster problem-solving.
- Game Theory: Coding strategy functions that best-respond to known opponents.
- Computer Vision/NLP: Architectural changes, training regime adjustments, and data augmentation.
- Reinforcement Learning: Designing training loops or tuning PPO-based learning.
Several tasks generate synthetic data on-the-fly (e.g., 3-SAT via random instance generators), eliminating the need for dataset downloads and encouraging generalization.
3. Modularity, Integration, and Extension
The MLGym framework exhibits modular design, ensuring straightforward integration and extensibility:
- Agents: Any base LLM can be wrapped using a standard protocol (interaction history in, next action out). The default implementation, "SWE-Agent," uses a ReAct-style loop and cost tracking.
- Environment: Each task is bootstrapped in a Docker-based Gymnasiumâ„¢ container with a dedicated "agent" user, deploying task-specific Conda environments, enacting file permissions, and loading code/data workspaces.
- Datasets and Tasks: YAML or JSON configuration files stipulate datasets, initial code, evaluation scripts, resource constraints (timeout, memory), admissible tools, and submission formats.
- Task Registration is streamlined; for example: 3
- Training Invocation uses simple, reproducible scripts: 4 A plausible implication is that the design supports rapid prototyping of both tasks and agent strategies as new algorithms emerge.
4. Baseline Agent Performance and Evaluation Metrics
MLGym-Bench assesses state-of-the-art LLM agents, including Claude-3.5-Sonnet, Llama-3.1-405B-Instruct, GPT-4o, OpenAI O1-preview, and Gemini-1.5-Pro. Performance is evaluated using Dolan & Moré-style performance profiles. For each method on task set : 0
1
Two variants are reported:
- Best Attempt@4: Best intermediate validation across 4 runs.
- Best Submission@4: Final submitted metric averaged over 4 runs.
Selected results (Best Attempt@4):
| Task | Baseline | O1-preview | GPT-4o |
|---|---|---|---|
| CIFAR-10 Accuracy | 0.497 | 0.857 | 0.733 |
| Prisoner’s Dilemma | 2.372 | 2.629 | 2.600 |
| Language Modeling Loss | 4.673 | 3.966 | 4.361 |
| Breakout Return | 48.82 | 63.52 | ∞ (fail) |
| 3-SAT Wall-time (s) | 16.16 | 13.652 | 13.676 |
O1-preview achieves the highest AUP (2), with Gemini-1.5-Pro and Claude-3.5 closely following. Gemini-1.5-Pro attains approximately 99% of O1’s AUP at 1/9th the API cost, representing the most cost-efficient option.
5. Findings, Limitations, and Prospects
The principal findings indicate that all evaluated frontier LLMs surpass weak baselines—primarily via hyperparameter optimization or minor code edits. However, no agent generated genuinely novel algorithms, hypotheses, or architectures beyond the reach of a proficient software engineer. Open-ended or complex tasks (e.g., RL with Breakout or MetaMaze, language modeling for FineWeb) remain unsolved or are tackled suboptimally.
Key weaknesses of current agents include a lack of long-horizon reasoning (frequent loss of optimal progress without memory augmentation) and a failure to contribute novel scientific ideas or algorithmic innovations ("Levels 2+" per the framework’s capability hierarchy). High computational cost is also a significant limitation for extensive benchmarking.
Future developments are expected to include:
- Expanding to larger, domain-specific datasets and more complex tasks, possibly outside traditional AI domains.
- Deepening investigations into automatable scientific novelty, which remains an open challenge.
- Consideration of reproducibility and ongoing availability of resources, since loss of access to training data could hinder future research advances.
6. Significance for AI Research
MLGym-Bench delivers a formally defined framework and extensible benchmark for longitudinally measuring the progress of LLM-based AI research agents. It rigorously evaluates agents not only on code manipulation and optimization but also on creative abilities critical to scientific research. The insights from its first experiments indicate the need for advancing memory, abstraction, and novelty-generation capabilities before LLM agents can participate meaningfully in scientific discovery processes (Nathani et al., 20 Feb 2025).