Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 221 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement (2506.15692v2)

Published 27 May 2025 in cs.LG

Abstract: Agents based on LLMs for machine learning engineering (MLE) can automatically implement ML models via code generation. However, existing approaches to build such agents often rely heavily on inherent LLM knowledge and employ coarse exploration strategies that modify the entire code structure at once. This limits their ability to select effective task-specific models and perform deep exploration within specific components, such as experimenting extensively with feature engineering options. To overcome these, we propose MLE-STAR, a novel approach to build MLE agents. MLE-STAR first leverages external knowledge by using a search engine to retrieve effective models from the web, forming an initial solution, then iteratively refines it by exploring various strategies targeting specific ML components. This exploration is guided by ablation studies analyzing the impact of individual code blocks. Furthermore, we introduce a novel ensembling method using an effective strategy suggested by MLE-STAR. Our experimental results show that MLE-STAR achieves medals in 64% of the Kaggle competitions on the MLE-bench Lite, significantly outperforming the best alternative.

Summary

  • The paper introduces MLE-STAR, a novel agent that automates ML model development by integrating web search with targeted code refinement, achieving significant performance gains in Kaggle competitions.
  • The paper employs a two-phase refinement methodology with outer-loop targeted code block selection and inner-loop iterative improvements guided by ablation studies and ensemble strategies.
  • The paper demonstrates that leveraging external model information and safety modules like debugging and data leakage checkers enhances model reliability and generalizability.

MLE-STAR: Automated Machine Learning Engineering

This paper introduces MLE-STAR, a novel machine learning engineering (MLE) agent that automates the development of ML models by integrating web search and targeted code refinement. MLE-STAR addresses limitations in existing MLE agents, which often rely on inherent LLM knowledge and employ coarse exploration strategies. By leveraging external knowledge and focusing on specific ML components, MLE-STAR achieves significant performance gains in Kaggle competitions.

Methodological Overview

MLE-STAR operates through a series of steps designed to optimize ML model development. The process begins with generating an initial solution by retrieving relevant models from the web using Google Search. This initial solution is then iteratively refined through nested loops. The outer loop selects a specific code block corresponding to an ML component, guided by an ablation paper that evaluates the impact of each component. The inner loop refines the selected code block, using previous attempts as feedback. A novel ensemble method is also introduced, which leverages LLMs to propose and refine ensemble strategies. Figure 1

Figure 1: Overview of MLE-STAR.

The formal problem setup involves finding an optimal solution s=argmaxsSh(s)s^{*}=\arg\max_{s\in\mathcal{S}} h(s), where S\mathcal{S} is the space of possible solutions and hh is a score function. MLE-STAR uses a multi-agent framework A\mathcal{A} consisting of nn LLM agents (A1,,An)(\mathcal{A}_1, \cdots, \mathcal{A}_n), each with specific functionalities. The framework takes datasets D\mathcal{D} and a task description Ttask\mathcal{T}_{\mathtt{task}} as input, working across any data modalities and task types.

Initial Solution Generation

The initial solution is generated by first retrieving MM effective models using web search. This mitigates the reliance on LLM's internal knowledge, which can lead to suboptimal model choices. The search retrieves both a model description Tmodel\mathcal{T}_\mathtt{model} and corresponding example code Tcode\mathcal{T}_\mathtt{code} to guide the LLM. An agent Ainit\mathcal{A}_\mathtt{init} generates code sinitis_\mathtt{init}^i for each retrieved model, which is then evaluated using a task-specific metric hh. The top-performing scripts are iteratively merged into a consolidated initial solution s0s_0 using an agent Amerger\mathcal{A}_\mathtt{merger}, which is guided to introduce a simple average ensemble.

Code Block Refinement

The iterative refinement phase improves the initial solution s0s_0 over TT outer loop steps. At each step tt, the goal is to improve the current solution sts_t by targeting specific code blocks within the ML pipeline. An ablation paper, performed by agent Aabl\mathcal{A}_\mathtt{abl}, identifies the code block with the greatest impact on performance. This agent receives summaries of previous ablation studies {Tabli}i=0t1\{\mathcal{T}_\mathtt{abl}^i\}_{i=0}^{t-1} as input to encourage exploration of different pipeline parts. The ablation paper results rtr_t are summarized by a module Asummarize\mathcal{A}_\mathtt{summarize} to generate a concise ablation summary Tablt\mathcal{T}_\mathtt{abl}^t. An extractor module Aextractor\mathcal{A}_\mathtt{extractor} identifies the code block ctc_t whose modification had the most significant impact, considering previously refined blocks {ci}i=0t1\{c_i\}_{i=0}^{t-1} as context. An initial plan p0p_0 for code block refinement is generated at the same time.

Once the targeted code block ctc_t is defined, MLE-STAR explores KK potential refinements using an inner loop. An agent Acoder\mathcal{A}_\mathtt{coder} implements p0p_0, transforming ctc_t into a refined block ct0c_t^0. A candidate solution st0s_t^0 is formed by substituting ct0c_t^0 into sts_t, and its performance h(st0)h(s_t^0) is evaluated. Further plans pkp_k are generated by a planning agent Aplanner\mathcal{A}_\mathtt{planner}, which leverages previous attempts within the current outer step tt as feedback. For each plan pkp_k, the coding agent generates the corresponding refined block ctkc_t^k, creates the candidate solution stks_t^k, and evaluates its performance h(stk)h(s_t^k). After exploring KK refinement strategies, the best-performing candidate solution is identified, and the solution for the next outer step, st+1s_{t+1}, is updated only if an improvement over sts_t is found. This iterative process continues until t=Tt=T.

Ensemble Strategy Exploration

To further improve upon the best single solution, a novel ensembling procedure is introduced. Instead of simply selecting the solution with the highest score, MLE-STAR explores ensemble strategies to combine multiple solutions. Given a set of LL distinct solutions {sl}l=1L\{s_l\}_{l=1}^L, the goal is to find an effective ensemble plan ee that merges these solutions. The process starts with an initial ensemble plan e0e_0, such as averaging the final predictions. For a fixed number of iterations RR, an agent Aens_planner\mathcal{A}_\mathtt{ens\_planner} proposes subsequent ensemble plans ere_r, using the history of previously attempted plans and their performance as feedback. Each plan ere_r is implemented via Aensembler\mathcal{A}_\mathtt{ensembler} to obtain sensrs_\mathtt{ens}^r. The ensemble result that achieves the highest performance is selected as the final output.

Additional Modules for Robustness

To ensure robust performance, MLE-STAR includes additional modules such as a debugging agent Adebugger\mathcal{A}_\mathtt{debugger} to correct errors in the generated code. A data leakage checker Aleakage\mathcal{A}_\mathtt{leakage} analyzes the solution script to prevent improper access to test data during training. A data usage checker Adata\mathcal{A}_\mathtt{data} ensures that all relevant provided data sources are utilized. Figure 2

Figure 2

Figure 2: MLE-STAR's data leakage checker introduces appropriate preprocessing.

Experimental Evaluation

MLE-STAR's effectiveness was evaluated using 22 Kaggle competitions from MLE-bench Lite. The results demonstrate that MLE-STAR significantly outperforms baselines, including AIDE, in terms of medal achievement rates. For example, MLE-STAR with Gemini-2.0-Flash improved AIDE's medal achievement rate from 25.8\% to 43.9\%. The proposed ensemble technique also provides a meaningful improvement. Figure 3

Figure 3

Figure 3: Model usage (\%) on image classification competitions.

The model usage analysis reveals that MLE-STAR leverages more recent and competitive models compared to baselines like AIDE, which tend to rely on older architectures such as ResNet. Manual intervention can further enhance MLE-STAR's performance by incorporating human expertise in model selection and code block targeting. The debugging and data leakage checker modules address potential issues with LLM-generated code, ensuring more reliable and generalizable solutions.

Discussion of Results

Qualitative observations indicate that MLE-STAR selects more up-to-date models. The data leakage checker ensures that the model is not cheating by using information from the test set in its training. The data usage checker ensures that all available data is used. Figure 4

Figure 4: Ensembling solutions.

Analysis of the solution refinement trajectory shows a consistent improvement in performance as MLE-STAR progresses through its refinement steps. The ablation paper module effectively targets the most influential code blocks for modification, leading to significant improvements in the early stages of refinement.

Conclusion

MLE-STAR represents a significant advancement in automated machine learning engineering. By integrating web search, targeted code refinement, and ensemble strategy exploration, MLE-STAR achieves state-of-the-art performance in Kaggle competitions. The framework's modular design and additional safety checks contribute to its robustness and generalizability. A limitation is that LLMs might have been trained with relevant discussions about the challenge due to the competitions being publicly available.

The development of MLE-STAR highlights the potential of LLMs to automate and improve the ML model development process. Future research could focus on extending the framework to handle more complex tasks, incorporating more sophisticated reasoning and planning capabilities, and further reducing the need for human intervention. The broader impacts of MLE-STAR include lowering the barrier to entry for individuals and organizations to leverage ML, and fostering innovation across various sectors.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 0 likes.

Youtube Logo Streamline Icon: https://streamlinehq.com