MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering (2410.07095v4)

Published 9 Oct 2024 in cs.CL

Abstract: We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier LLMs on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents.

Authors (12)

Jun Shern Chan (8 papers)
Neil Chowdhury (7 papers)
Oliver Jaffe (2 papers)
James Aung (3 papers)
Dane Sherburn (5 papers)
Evan Mays (3 papers)
Giulio Starace (4 papers)
Kevin Liu (34 papers)
Leon Maksin (3 papers)
Tejal Patwardhan (6 papers)
Lilian Weng (22 papers)
Aleksander Mądry (11 papers)

Citations (8)

View on Semantic Scholar

Summary

Evaluating Aspects of Machine Learning Through MLE-bench

The paper introduces MLE-bench, a substantial benchmark developed to evaluate AI agents on machine learning engineering tasks specifically structured around Kaggle competitions. This benchmark is particularly notable for its focus on autonomous ML engineering capabilities, reflecting real-world challenges faced in fields such as natural language processing, computer vision, and signal processing.

Benchmark Characteristics

MLE-bench comprises 75 diverse competitions sourced from Kaggle, curated to provide a rigorous evaluation of an agent's ability to perform tasks like model training, dataset preparation, and experimental analysis. Notably, these challenges are representative of contemporary ML engineering tasks, aiming to provide a concrete comparison of AI capabilities relative to human performance. Human benchmarks are established through publicly available Kaggle leaderboards, ensuring a relevant and fair performance assessment framework.

The paper reveals that the best-performing AI setup, OpenAI's o1-preview with AIDE scaffolding, achieves performance comparable to a Kaggle bronze medal in 16.9% of competitions. This is a distinctive result that highlights the capability yet existing limitations of current systems when tasked with sophisticated ML engineering challenges.

Technical Evaluation

The evaluation involves several experimental setups varying scaffolding and AI models. It extensively explores how performance scales with multiple attempts and altering hardware configurations. The agents' performance improves notably with an increased number of attempts, showcasing a score amplification — a crucial insight for understanding potential scaling behaviors in AI systems.

Furthermore, among the tested scaffolds, AIDE showed superior performance, suggesting that purpose-built scaffolding offers tangible benefits in system design over more general scaffolds.

Handling Resource Scaling and Contamination

The research also explores resource-scaling aspects, examining how different compute environments influence results. Surprisingly, varied GPU access did not significantly impact performance in the tested models, indicating a potential area for future research into how agents manage computational resources effectively.

Testing around dataset contamination highlighted that familiarity with Kaggle competition data did not result in systematically inflated performance scores. This is critical for verifying the generalizability of model capabilities beyond memorized dataset scenarios.

Implications and Future Outlook

The introduction of MLE-bench serves dual purposes: offering a tool to facilitate future research and establishing a deeper understanding of the risk associated with accelerating AI R&D capabilities. The results from MLE-bench imply that significant progress is still required for AI to autonomously conduct comprehensive ML R&D tasks at human levels.

Theoretically, advancements in this area could lead to accelerated scientific progress and economic growth. However, there must be caution in the deployment of such AI capabilities due to potential acceleration mismatches relative to safety and control measures.

Conclusion

In conclusion, MLE-bench represents a notable effort to rigorously quantify the capabilities of AI systems in performing complex machine learning engineering tasks. This benchmark does not only facilitate the understanding of current AI limits and capabilities but also acts as a precursor in anticipating future developments as systems approach parity with human researchers in executing intricate ML tasks. The availability of MLE-bench encourages further exploration and development in this domain, advocating for advancements with a mindful approach towards societal impacts and ethical considerations.