Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 89 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 93 tok/s Pro

Kimi K2 221 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement (2506.15692v2)

Published 27 May 2025 in cs.LG

Abstract: Agents based on LLMs for machine learning engineering (MLE) can automatically implement ML models via code generation. However, existing approaches to build such agents often rely heavily on inherent LLM knowledge and employ coarse exploration strategies that modify the entire code structure at once. This limits their ability to select effective task-specific models and perform deep exploration within specific components, such as experimenting extensively with feature engineering options. To overcome these, we propose MLE-STAR, a novel approach to build MLE agents. MLE-STAR first leverages external knowledge by using a search engine to retrieve effective models from the web, forming an initial solution, then iteratively refines it by exploring various strategies targeting specific ML components. This exploration is guided by ablation studies analyzing the impact of individual code blocks. Furthermore, we introduce a novel ensembling method using an effective strategy suggested by MLE-STAR. Our experimental results show that MLE-STAR achieves medals in 64% of the Kaggle competitions on the MLE-bench Lite, significantly outperforming the best alternative.

Summary

The paper introduces MLE-STAR, a novel agent that automates ML model development by integrating web search with targeted code refinement, achieving significant performance gains in Kaggle competitions.
The paper employs a two-phase refinement methodology with outer-loop targeted code block selection and inner-loop iterative improvements guided by ablation studies and ensemble strategies.
The paper demonstrates that leveraging external model information and safety modules like debugging and data leakage checkers enhances model reliability and generalizability.

MLE-STAR: Automated Machine Learning Engineering

This paper introduces MLE-STAR, a novel machine learning engineering (MLE) agent that automates the development of ML models by integrating web search and targeted code refinement. MLE-STAR addresses limitations in existing MLE agents, which often rely on inherent LLM knowledge and employ coarse exploration strategies. By leveraging external knowledge and focusing on specific ML components, MLE-STAR achieves significant performance gains in Kaggle competitions.

Methodological Overview

MLE-STAR operates through a series of steps designed to optimize ML model development. The process begins with generating an initial solution by retrieving relevant models from the web using Google Search. This initial solution is then iteratively refined through nested loops. The outer loop selects a specific code block corresponding to an ML component, guided by an ablation paper that evaluates the impact of each component. The inner loop refines the selected code block, using previous attempts as feedback. A novel ensemble method is also introduced, which leverages LLMs to propose and refine ensemble strategies.

Figure 1: Overview of MLE-STAR.

The formal problem setup involves finding an optimal solution $s^{*}=\arg\max_{s\in\mathcal{S}} h(s)$ , where $\mathcal{S}$ is the space of possible solutions and $h$ is a score function. MLE-STAR uses a multi-agent framework $\mathcal{A}$ consisting of $n$ LLM agents $(\mathcal{A}_1, \cdots, \mathcal{A}_n)$ , each with specific functionalities. The framework takes datasets $\mathcal{D}$ and a task description $\mathcal{T}_{\mathtt{task}}$ as input, working across any data modalities and task types.

Initial Solution Generation

The initial solution is generated by first retrieving $M$ effective models using web search. This mitigates the reliance on LLM's internal knowledge, which can lead to suboptimal model choices. The search retrieves both a model description $\mathcal{T}_\mathtt{model}$ and corresponding example code $\mathcal{T}_\mathtt{code}$ to guide the LLM. An agent $\mathcal{A}_\mathtt{init}$ generates code $s_\mathtt{init}^i$ for each retrieved model, which is then evaluated using a task-specific metric $h$ . The top-performing scripts are iteratively merged into a consolidated initial solution $s_0$ using an agent $\mathcal{A}_\mathtt{merger}$ , which is guided to introduce a simple average ensemble.

The iterative refinement phase improves the initial solution $s_0$ over $T$ outer loop steps. At each step $t$ , the goal is to improve the current solution $s_t$ by targeting specific code blocks within the ML pipeline. An ablation paper, performed by agent $\mathcal{A}_\mathtt{abl}$ , identifies the code block with the greatest impact on performance. This agent receives summaries of previous ablation studies $\{\mathcal{T}_\mathtt{abl}^i\}_{i=0}^{t-1}$ as input to encourage exploration of different pipeline parts. The ablation paper results $r_t$ are summarized by a module $\mathcal{A}_\mathtt{summarize}$ to generate a concise ablation summary $\mathcal{T}_\mathtt{abl}^t$ . An extractor module $\mathcal{A}_\mathtt{extractor}$ identifies the code block $c_t$ whose modification had the most significant impact, considering previously refined blocks $\{c_i\}_{i=0}^{t-1}$ as context. An initial plan $p_0$ for code block refinement is generated at the same time.

Once the targeted code block $c_t$ is defined, MLE-STAR explores $K$ potential refinements using an inner loop. An agent $\mathcal{A}_\mathtt{coder}$ implements $p_0$ , transforming $c_t$ into a refined block $c_t^0$ . A candidate solution $s_t^0$ is formed by substituting $c_t^0$ into $s_t$ , and its performance $h(s_t^0)$ is evaluated. Further plans $p_k$ are generated by a planning agent $\mathcal{A}_\mathtt{planner}$ , which leverages previous attempts within the current outer step $t$ as feedback. For each plan $p_k$ , the coding agent generates the corresponding refined block $c_t^k$ , creates the candidate solution $s_t^k$ , and evaluates its performance $h(s_t^k)$ . After exploring $K$ refinement strategies, the best-performing candidate solution is identified, and the solution for the next outer step, $s_{t+1}$ , is updated only if an improvement over $s_t$ is found. This iterative process continues until $t=T$ .

Ensemble Strategy Exploration

To further improve upon the best single solution, a novel ensembling procedure is introduced. Instead of simply selecting the solution with the highest score, MLE-STAR explores ensemble strategies to combine multiple solutions. Given a set of $L$ distinct solutions $\{s_l\}_{l=1}^L$ , the goal is to find an effective ensemble plan $e$ that merges these solutions. The process starts with an initial ensemble plan $e_0$ , such as averaging the final predictions. For a fixed number of iterations $R$ , an agent $\mathcal{A}_\mathtt{ens\_planner}$ proposes subsequent ensemble plans $e_r$ , using the history of previously attempted plans and their performance as feedback. Each plan $e_r$ is implemented via $\mathcal{A}_\mathtt{ensembler}$ to obtain $s_\mathtt{ens}^r$ . The ensemble result that achieves the highest performance is selected as the final output.

Additional Modules for Robustness

To ensure robust performance, MLE-STAR includes additional modules such as a debugging agent $\mathcal{A}_\mathtt{debugger}$ to correct errors in the generated code. A data leakage checker $\mathcal{A}_\mathtt{leakage}$ analyzes the solution script to prevent improper access to test data during training. A data usage checker $\mathcal{A}_\mathtt{data}$ ensures that all relevant provided data sources are utilized.

Figure 2: MLE-STAR's data leakage checker introduces appropriate preprocessing.

Experimental Evaluation

MLE-STAR's effectiveness was evaluated using 22 Kaggle competitions from MLE-bench Lite. The results demonstrate that MLE-STAR significantly outperforms baselines, including AIDE, in terms of medal achievement rates. For example, MLE-STAR with Gemini-2.0-Flash improved AIDE's medal achievement rate from 25.8\% to 43.9\%. The proposed ensemble technique also provides a meaningful improvement.

Figure 3: Model usage (\%) on image classification competitions.

The model usage analysis reveals that MLE-STAR leverages more recent and competitive models compared to baselines like AIDE, which tend to rely on older architectures such as ResNet. Manual intervention can further enhance MLE-STAR's performance by incorporating human expertise in model selection and code block targeting. The debugging and data leakage checker modules address potential issues with LLM-generated code, ensuring more reliable and generalizable solutions.

Discussion of Results

Qualitative observations indicate that MLE-STAR selects more up-to-date models. The data leakage checker ensures that the model is not cheating by using information from the test set in its training. The data usage checker ensures that all available data is used.

Figure 4: Ensembling solutions.

Analysis of the solution refinement trajectory shows a consistent improvement in performance as MLE-STAR progresses through its refinement steps. The ablation paper module effectively targets the most influential code blocks for modification, leading to significant improvements in the early stages of refinement.

Conclusion

MLE-STAR represents a significant advancement in automated machine learning engineering. By integrating web search, targeted code refinement, and ensemble strategy exploration, MLE-STAR achieves state-of-the-art performance in Kaggle competitions. The framework's modular design and additional safety checks contribute to its robustness and generalizability. A limitation is that LLMs might have been trained with relevant discussions about the challenge due to the competitions being publicly available.

The development of MLE-STAR highlights the potential of LLMs to automate and improve the ML model development process. Future research could focus on extending the framework to handle more complex tasks, incorporating more sophisticated reasoning and planning capabilities, and further reducing the need for human intervention. The broader impacts of MLE-STAR include lowering the barrier to entry for individuals and organizations to leverage ML, and fostering innovation across various sectors.