MLE-STAR: Agentic AutoML System

Updated 11 October 2025

The paper introduces MLE-STAR as a novel agentic system that automates model engineering by integrating LLMs, web-based state-of-the-art retrieval, and block-level refinement.
It uses iterative ablation studies and targeted code block exploration to significantly improve validation scores on competitive benchmarks like MLE-bench Lite.
The system uniquely combines candidate merging and ensemble strategies to boost model robustness and adaptivity across diverse machine learning tasks.

MLE-STAR refers to a novel agentic system for automated machine learning engineering that leverages LLMs, external knowledge retrieval, and component-level targeted refinement strategies. It is designed to overcome key limitations of previous LLM-based agents by (i) retrieving state-of-the-art, task-specific model solutions from the web and (ii) iteratively refining candidate solutions via deep exploration at the code block level, guided by ablation studies and ensemble strategies. MLE-STAR’s operational paradigm enables superior task adaptation and code quality for competitive machine learning scenarios, as demonstrated by its performance on the MLE-bench Lite Kaggle benchmark suite (Nam et al., 27 May 2025).

1. System Architecture and Workflow

MLE-STAR comprises multiple specialized agent modules orchestrated in a sequential and iterative workflow:

Retrieval phase: The retriever agent formulates queries to a web search engine, sourcing $M$ candidate model solutions ( ${𝒯}_\text{model}^j$ ) relevant to the input data and task specification ( $𝒯_\text{task}$ ).
Candidate evaluation: Each retrieved solution is instantiated as executable code ( $s_\text{init}^j$ ) and its validation score $h(s_\text{init}^j)$ is measured.
Solution merging: The merger agent sequentially integrates candidate solutions by block-level replacement or ensembling, retaining only those code modifications that improve the validation score.
Component-level ablation and refinement: The core innovation—after assembling an initial solution $s_0$ , the system initiates outer loop iterations (up to $T$ cycles), performing ablation analysis to identify critical code blocks whose modification yields maximal improvement.
Iterative target block refinement: Within each targeted block, an inner loop (up to $K$ cycles) applies refinement plans ( $p_k$ ), utilizing code generation and performance feedback to converge on high-performing implementations, guided by a planning agent.

This modular agent framework enables MLE-STAR to combine external knowledge, LLM generation, and code component selection with iterative ablation-based optimization—a departure from global solution modifications typical of prior LLM agents.

2. Targeted Ablation and Deep Component Exploration

The hallmark feature of MLE-STAR is its ablation-guided, targeted code block exploration:

Ablation paper phase: Using the ablation agent ( $𝒜_\text{abl}$ ), the system creates variant scripts with specific blocks deactivated or altered, then records performance deltas to generate an ablation summary ( $𝒯_\text{abl}^t$ ).
Critical block extraction: An extractor agent ( $𝒜_\text{extractor}$ ) identifies the code block $c_t$ with the largest impact, and proposes an initial refinement plan $p_0$ .
Nested refinement iteration: The planner agent and coder agent iterate over block proposals, each incorporating feedback from previous changes (score $h(s_t^k)$ ) and exploration history to generate improved code variants.
Block replacement and solution update: After $K$ refinements, the best-performing candidate is incorporated into the full solution, and the ablation-summary log is updated.

This architecture supports deep exploration in critical ML pipeline segments (especially feature engineering routines), providing fine-grained adjustment rather than undirected structural changes. It enables prioritization of model selection and preprocessing transformations that significantly affect downstream performance.

3. Ensemble Construction and Selection Strategies

Beyond single-model optimization, MLE-STAR introduces advanced ensemble formation:

Parallel final solution generation: Following iterative refinement, multiple high-performing candidate solutions ( $s_\text{final}^l$ for $l = 1 \ldots L$ ) are retained.
Ensemble planning agent ( $𝒜_\text{ens\_planner}$ ): Proposes ensemble strategies (e.g., averaging predictions, stacking with meta-learners), utilizing feedback from historical ensembling efforts.
Iterative ensemble refinement: Over $R$ ensemble iterations, candidate plans are assessed and adjusted, selecting the ensemble output ( $s_\text{ens}^*$ ) with maximal validation score.

This explicit ensemble module distinguishes MLE-STAR from conventional AutoML agents and previous LLM solutions, systematically leveraging complementary strengths of diverse candidates for robust, high-variance reduction on competitive benchmarks.

4. Experimental Evaluation and Benchmark Achievement

MLE-STAR’s empirical results, as reported:

Performance on MLE-bench Lite (Kaggle competitions): The agent attains a medal rate of 64% on 22 tasks using Gemini-2.5-Pro LLM, substantially exceeding the baseline AIDE agent’s 25.8% and outperforming DS-Agent and AutoGluon in both medal count and average performance metrics.
Gold medal improvement: Ensemble strategies directly elevate the proportion of gold medals and solutions above the median score.
Component ablation analysis: Ablation experiments demonstrate the utility of targeted block refinement not only for individual metric improvement but also for boosting aggregate system outputs via ensembling.
Generalization across modalities: While current evaluation focuses on tabular tasks, the agent’s modular structure is designed for extensibility to other domains (e.g., image, text), contingent upon further experimental validation.

A plausible implication is that targeted block-level refinement, coupled with external retrieval and systematic ensembling, is a dominant methodology for agentic ML code optimization in competitive data science.

5. Comparison with Previous Agentic and AutoML Systems

MLE-STAR’s design specifically addresses limitations of prior agents:

Reliance on internal LLM knowledge: Previous LLM-based MLE agents typically generate solutions from model-internal knowledge, restricting adaptability to current best practices and limiting performance on task-specific benchmarks.
Coarse code-level exploration: Rather than whole-code mutations, MLE-STAR employs granular search through ablation-driven block replacement.
Integration of dynamic web knowledge: The agent can incorporate new state-of-the-art model architectures as soon as they are published and indexed, maintaining competitive viability over time.
Robust debugging and leakage prevention modules: Although ongoing work is required for complete robustness, these components enable higher reliability than AutoML agents reliant on static rule sets.

6. Future Directions and Research Implications

MLE-STAR’s modular agentic framework opens several avenues for future work:

Expansion to multimodal tasks: The targeted-ablation methodology can be adapted for image, text, and audio pipelines, with new block-selection heuristics and more elaborate code decomposition strategies.
Advanced ensemble search: Integrating richer ensembling protocols, such as non-linear blendings or meta-learned ensemble weights.
Robustness and reliability augmentation: Refinement of debugging and data leakage prevention modules, as undetected LLM misbehaviors can undermine automated solution validity.
Scalability to hierarchical ML workflows: Extension beyond one-step pipelines to complex multi-stage ML systems, leveraging the recursive composition and ablation logging mechanisms.

Given the design’s inherent ability to incorporate external knowledge and adapt to model innovations, MLE-STAR is positioned to remain effective as LLMs and ML methods evolve.

In summary, MLE-STAR operationalizes a multi-agent, retrieval-augmented, block-level refined approach to automated machine learning engineering. Its experimental results, component modularity, and targeted exploration strategy distinguish it as a significant advancement in competitive machine learning automation (Nam et al., 27 May 2025).

PDF Markdown Chat (Pro)

References (1)

MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement (2025)

Follow Topic

Get notified by email when new papers are published related to MLE-STAR.