MLE-Dojo: Interactive MLE Benchmark

Updated 11 October 2025

MLE-Dojo is an interactive environment that benchmarks autonomous LLM agents within realistic machine learning engineering workflows.
It utilizes 200+ Kaggle challenges to provide multi-turn feedback, iterative debugging, and standardized evaluation via a POMDP framework.
The platform supports both supervised fine-tuning and reinforcement learning, fostering community-driven innovation in autonomous MLE agent design.

MLE-Dojo is a Gym-style interactive environment and benchmark for systematically training, evaluating, and improving autonomous LLM agents in realistic machine learning engineering (MLE) workflows. Designed to go beyond static datasets or single-shot evaluations, MLE-Dojo is instantiated with 200+ real-world Kaggle challenges, providing LLM agents with the ability to experiment, debug, and iteratively refine their solutions using structured feedback. The environment encompasses diverse machine learning tasks (including data preprocessing, model architecture search, hyperparameter optimization, code debugging, and performance validation) and enables both supervised fine-tuning and reinforcement learning under a unified, reproducible, extensible interface (Qiang et al., 12 May 2025).

1. Concept and Framework

MLE-Dojo is conceived as an interactive, executable environment for end-to-end MLE workflows. Unlike static leaderboard-based or single-turn code competition frameworks, MLE-Dojo formalizes the MLE episode as a Partially Observable Markov Decision Process (POMDP):

States and Observations: Each agent receives a rich observation $o_t$ at each time step $t$ , including task details, code execution outputs, error diagnostics, leaderboard position, and execution history.
Actions: The allowable actions are explicit, code-centric, and modular. Agents produce Python code through actions such as validate_code, execute_code, request_info, get_history, and reset, which are dispatched to the task-specific backend.
Rewards: A reward $r_t$ is computed after each step, based on a normalized "HumanRank" score $s = 1 - p/N$ (where $p$ is the present leaderboard rank and $N$ is the number of competitors for the given Kaggle competition).

This interactive structure enables agents to iteratively submit, test, debug, and adapt code—binding agent behavior tightly to how modern data scientists solve open-ended MLE tasks.

2. Interactive Environment and Task Structure

MLE-Dojo's environment features several distinguishing components:

Executable Sandbox: All code execution occurs in Docker-isolated containers, with preinstalled competition-specific dependencies and standardized environment setup. This ensures outcome reproducibility and safe agent experimentation.
Rich Observation and Feedback: After each action, the agent receives direct feedback, including errors, exceptions, detailed output metrics, scoring summaries, and historical action context.
Iterative Loop: Agents can reason over prior attempts using both conversational and code-execution history, allowing long-horizon planning and progressive refinement.
Task Approximation: Tasks are algorithmically distilled from over 200 Kaggle competitions, spanning modalities (tabular, NLP, vision) and MLE subtasks (data cleaning, feature engineering, model fitting, ensembling, deployment).
Plug-and-Play Extensibility: Tasks are modular; new competitions, data sources, or third-party tools may be added without system overhaul.

3. Real-World Benchmarking and Reward Design

Unlike prior benchmarks, MLE-Dojo leverages real competition data and scoring systems:

Dataset Realism: Each task mirrors a true Kaggle competition, including provided raw data, evaluation metrics, test/train splits, canonical evaluation scripts, and baseline solutions.
Continuous Evaluation: The environment supports both incremental and full-evaluation modes, running code on public/private splits and computing leaderboard positions to provide an immediate, normalized reward signal.
HumanRank Score: Agent performance is translated to a normalized score $s = 1 - p/N$ , ensuring comparability across tasks and dynamic adjustment to problem difficulty.
Open Sourcing of Artifacts: All environment scripts, scoring tools, and benchmarks are made public to facilitate cross-lab reproducibility and enable community contributions.

4. Agent Training Paradigms

MLE-Dojo enables two primary agent learning paradigms:

Supervised Fine-Tuning: Agents may be trained on a curated subset of 150 tasks via imitation learning, using community or human demonstration trajectories extracted from multi-turn solution histories.
Reinforcement Learning: The interactive POMDP formulation supports RL-based agent development, with stepwise feedback facilitating exploration strategies, credit assignment, and exploitation-refinement cycles.

This duality allows benchmarking both LLM instruction-learned baselines and fully autonomous RL-based research agents, broadening the scope of comparative analysis.

5. Architecture, Modularity, and Extensibility

The system is architected for both research extensibility and industry-grade reproducibility:

Module Decoupling: Core functionalities (error handling, code execution, feedback, metric evaluation) are implemented as independent modules, making it straightforward to extend, modify, or replace system components.
Standardized Task Format: All tasks follow a uniform directory and metadata schema, with clear separation of agent-accessible (public) and environment-private (hidden test splits, solution keys) resources.
Dockerized Isolation: Docker containers ensure consistent execution, environmental parity, and strict dependency management, minimizing confounding variation in agent performance across runs.
Action Registration: The framework supports action space extension via a dynamic registry, facilitating integration of new primitives or advanced control actions for experimental agent architectures.

6. Benchmark Results and Community Impact

Extensive evaluations, including eight frontier LLMs, reveal substantive insights:

Agent Performance: State-of-the-art LLM agents demonstrate iterative improvement within the MLE-Dojo workflow, but display significant gaps in long-horizon planning and correcting complex errors.
Failure Modes: Current LLMs struggle with solution composition over multiple code submissions, have difficulty recovering from deep debugging branches, and are inefficient on tasks requiring multiple dependency-resolving actions.
Research Acceleration: Open sourcing the benchmark, environment, leaderboard, and training code is aimed at fostering rapid, reproducible, community-driven innovation in MLE agent design.
Reproducibility: Standardization of tasks, result normalization (HumanRank), and containerized environments position MLE-Dojo as a highly reliable platform for longitudinal and cross-model comparison at scale.

7. Significance and Future Directions

MLE-Dojo uniquely addresses key limitations in prior MLE benchmarks by introducing an interactive, multi-turn, code-executing, feedback-driven environment rooted in real-world MLE challenges. Its modular architecture, strong reproducibility guarantees, and integration of both RL and supervised training protocols make it a versatile research testbed. The observed limitations of even frontier LLMs in MLE-Dojo highlight the need for advances in multi-step planning, robust code synthesis, adaptive exploration, and error recovery, which the community is poised to address using this framework. Open-sourcing and continuous benchmarking will further catalyze the development of next-generation, truly autonomous machine learning engineering agents (Qiang et al., 12 May 2025).

PDF Markdown Chat (Pro)

References (1)

MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering (2025)

Follow Topic

Get notified by email when new papers are published related to MLE-Dojo.