Papers
Topics
Authors
Recent
2000 character limit reached

MLE-STAR: Agentic AutoML System

Updated 11 October 2025
  • The paper introduces MLE-STAR as a novel agentic system that automates model engineering by integrating LLMs, web-based state-of-the-art retrieval, and block-level refinement.
  • It uses iterative ablation studies and targeted code block exploration to significantly improve validation scores on competitive benchmarks like MLE-bench Lite.
  • The system uniquely combines candidate merging and ensemble strategies to boost model robustness and adaptivity across diverse machine learning tasks.

MLE-STAR refers to a novel agentic system for automated machine learning engineering that leverages LLMs, external knowledge retrieval, and component-level targeted refinement strategies. It is designed to overcome key limitations of previous LLM-based agents by (i) retrieving state-of-the-art, task-specific model solutions from the web and (ii) iteratively refining candidate solutions via deep exploration at the code block level, guided by ablation studies and ensemble strategies. MLE-STAR’s operational paradigm enables superior task adaptation and code quality for competitive machine learning scenarios, as demonstrated by its performance on the MLE-bench Lite Kaggle benchmark suite (Nam et al., 27 May 2025).

1. System Architecture and Workflow

MLE-STAR comprises multiple specialized agent modules orchestrated in a sequential and iterative workflow:

  • Retrieval phase: The retriever agent formulates queries to a web search engine, sourcing MM candidate model solutions (𝒯modelj{𝒯}_\text{model}^j) relevant to the input data and task specification (𝒯task𝒯_\text{task}).
  • Candidate evaluation: Each retrieved solution is instantiated as executable code (sinitjs_\text{init}^j) and its validation score h(sinitj)h(s_\text{init}^j) is measured.
  • Solution merging: The merger agent sequentially integrates candidate solutions by block-level replacement or ensembling, retaining only those code modifications that improve the validation score.
  • Component-level ablation and refinement: The core innovation—after assembling an initial solution s0s_0, the system initiates outer loop iterations (up to TT cycles), performing ablation analysis to identify critical code blocks whose modification yields maximal improvement.
  • Iterative target block refinement: Within each targeted block, an inner loop (up to KK cycles) applies refinement plans (pkp_k), utilizing code generation and performance feedback to converge on high-performing implementations, guided by a planning agent.

This modular agent framework enables MLE-STAR to combine external knowledge, LLM generation, and code component selection with iterative ablation-based optimization—a departure from global solution modifications typical of prior LLM agents.

2. Targeted Ablation and Deep Component Exploration

The hallmark feature of MLE-STAR is its ablation-guided, targeted code block exploration:

  • Ablation study phase: Using the ablation agent (𝒜abl𝒜_\text{abl}), the system creates variant scripts with specific blocks deactivated or altered, then records performance deltas to generate an ablation summary (𝒯ablt𝒯_\text{abl}^t).
  • Critical block extraction: An extractor agent (𝒜extractor𝒜_\text{extractor}) identifies the code block ctc_t with the largest impact, and proposes an initial refinement plan p0p_0.
  • Nested refinement iteration: The planner agent and coder agent iterate over block proposals, each incorporating feedback from previous changes (score h(stk)h(s_t^k)) and exploration history to generate improved code variants.
  • Block replacement and solution update: After KK refinements, the best-performing candidate is incorporated into the full solution, and the ablation-summary log is updated.

This architecture supports deep exploration in critical ML pipeline segments (especially feature engineering routines), providing fine-grained adjustment rather than undirected structural changes. It enables prioritization of model selection and preprocessing transformations that significantly affect downstream performance.

3. Ensemble Construction and Selection Strategies

Beyond single-model optimization, MLE-STAR introduces advanced ensemble formation:

  • Parallel final solution generation: Following iterative refinement, multiple high-performing candidate solutions (sfinalls_\text{final}^l for l=1Ll = 1 \ldots L) are retained.
  • Ensemble planning agent (𝒜ens_planner𝒜_\text{ens\_planner}): Proposes ensemble strategies (e.g., averaging predictions, stacking with meta-learners), utilizing feedback from historical ensembling efforts.
  • Iterative ensemble refinement: Over RR ensemble iterations, candidate plans are assessed and adjusted, selecting the ensemble output (senss_\text{ens}^*) with maximal validation score.

This explicit ensemble module distinguishes MLE-STAR from conventional AutoML agents and previous LLM solutions, systematically leveraging complementary strengths of diverse candidates for robust, high-variance reduction on competitive benchmarks.

4. Experimental Evaluation and Benchmark Achievement

MLE-STAR’s empirical results, as reported:

  • Performance on MLE-bench Lite (Kaggle competitions): The agent attains a medal rate of 64% on 22 tasks using Gemini-2.5-Pro LLM, substantially exceeding the baseline AIDE agent’s 25.8% and outperforming DS-Agent and AutoGluon in both medal count and average performance metrics.
  • Gold medal improvement: Ensemble strategies directly elevate the proportion of gold medals and solutions above the median score.
  • Component ablation analysis: Ablation experiments demonstrate the utility of targeted block refinement not only for individual metric improvement but also for boosting aggregate system outputs via ensembling.
  • Generalization across modalities: While current evaluation focuses on tabular tasks, the agent’s modular structure is designed for extensibility to other domains (e.g., image, text), contingent upon further experimental validation.

A plausible implication is that targeted block-level refinement, coupled with external retrieval and systematic ensembling, is a dominant methodology for agentic ML code optimization in competitive data science.

5. Comparison with Previous Agentic and AutoML Systems

MLE-STAR’s design specifically addresses limitations of prior agents:

  • Reliance on internal LLM knowledge: Previous LLM-based MLE agents typically generate solutions from model-internal knowledge, restricting adaptability to current best practices and limiting performance on task-specific benchmarks.
  • Coarse code-level exploration: Rather than whole-code mutations, MLE-STAR employs granular search through ablation-driven block replacement.
  • Integration of dynamic web knowledge: The agent can incorporate new state-of-the-art model architectures as soon as they are published and indexed, maintaining competitive viability over time.
  • Robust debugging and leakage prevention modules: Although ongoing work is required for complete robustness, these components enable higher reliability than AutoML agents reliant on static rule sets.

6. Future Directions and Research Implications

MLE-STAR’s modular agentic framework opens several avenues for future work:

  • Expansion to multimodal tasks: The targeted-ablation methodology can be adapted for image, text, and audio pipelines, with new block-selection heuristics and more elaborate code decomposition strategies.
  • Advanced ensemble search: Integrating richer ensembling protocols, such as non-linear blendings or meta-learned ensemble weights.
  • Robustness and reliability augmentation: Refinement of debugging and data leakage prevention modules, as undetected LLM misbehaviors can undermine automated solution validity.
  • Scalability to hierarchical ML workflows: Extension beyond one-step pipelines to complex multi-stage ML systems, leveraging the recursive composition and ablation logging mechanisms.

Given the design’s inherent ability to incorporate external knowledge and adapt to model innovations, MLE-STAR is positioned to remain effective as LLMs and ML methods evolve.


In summary, MLE-STAR operationalizes a multi-agent, retrieval-augmented, block-level refined approach to automated machine learning engineering. Its experimental results, component modularity, and targeted exploration strategy distinguish it as a significant advancement in competitive machine learning automation (Nam et al., 27 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to MLE-STAR.