Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 118 tok/s Pro
Kimi K2 181 tok/s Pro
GPT OSS 120B 429 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

MLE-STAR: Agentic AutoML System

Updated 11 October 2025
  • The paper introduces MLE-STAR as a novel agentic system that automates model engineering by integrating LLMs, web-based state-of-the-art retrieval, and block-level refinement.
  • It uses iterative ablation studies and targeted code block exploration to significantly improve validation scores on competitive benchmarks like MLE-bench Lite.
  • The system uniquely combines candidate merging and ensemble strategies to boost model robustness and adaptivity across diverse machine learning tasks.

MLE-STAR refers to a novel agentic system for automated machine learning engineering that leverages LLMs, external knowledge retrieval, and component-level targeted refinement strategies. It is designed to overcome key limitations of previous LLM-based agents by (i) retrieving state-of-the-art, task-specific model solutions from the web and (ii) iteratively refining candidate solutions via deep exploration at the code block level, guided by ablation studies and ensemble strategies. MLE-STAR’s operational paradigm enables superior task adaptation and code quality for competitive machine learning scenarios, as demonstrated by its performance on the MLE-bench Lite Kaggle benchmark suite (Nam et al., 27 May 2025).

1. System Architecture and Workflow

MLE-STAR comprises multiple specialized agent modules orchestrated in a sequential and iterative workflow:

  • Retrieval phase: The retriever agent formulates queries to a web search engine, sourcing MM candidate model solutions (𝒯modelj{𝒯}_\text{model}^j) relevant to the input data and task specification (𝒯task𝒯_\text{task}).
  • Candidate evaluation: Each retrieved solution is instantiated as executable code (sinitjs_\text{init}^j) and its validation score h(sinitj)h(s_\text{init}^j) is measured.
  • Solution merging: The merger agent sequentially integrates candidate solutions by block-level replacement or ensembling, retaining only those code modifications that improve the validation score.
  • Component-level ablation and refinement: The core innovation—after assembling an initial solution s0s_0, the system initiates outer loop iterations (up to TT cycles), performing ablation analysis to identify critical code blocks whose modification yields maximal improvement.
  • Iterative target block refinement: Within each targeted block, an inner loop (up to KK cycles) applies refinement plans (pkp_k), utilizing code generation and performance feedback to converge on high-performing implementations, guided by a planning agent.

This modular agent framework enables MLE-STAR to combine external knowledge, LLM generation, and code component selection with iterative ablation-based optimization—a departure from global solution modifications typical of prior LLM agents.

2. Targeted Ablation and Deep Component Exploration

The hallmark feature of MLE-STAR is its ablation-guided, targeted code block exploration:

  • Ablation paper phase: Using the ablation agent (𝒜abl𝒜_\text{abl}), the system creates variant scripts with specific blocks deactivated or altered, then records performance deltas to generate an ablation summary (𝒯ablt𝒯_\text{abl}^t).
  • Critical block extraction: An extractor agent (𝒜extractor𝒜_\text{extractor}) identifies the code block ctc_t with the largest impact, and proposes an initial refinement plan p0p_0.
  • Nested refinement iteration: The planner agent and coder agent iterate over block proposals, each incorporating feedback from previous changes (score h(stk)h(s_t^k)) and exploration history to generate improved code variants.
  • Block replacement and solution update: After KK refinements, the best-performing candidate is incorporated into the full solution, and the ablation-summary log is updated.

This architecture supports deep exploration in critical ML pipeline segments (especially feature engineering routines), providing fine-grained adjustment rather than undirected structural changes. It enables prioritization of model selection and preprocessing transformations that significantly affect downstream performance.

3. Ensemble Construction and Selection Strategies

Beyond single-model optimization, MLE-STAR introduces advanced ensemble formation:

  • Parallel final solution generation: Following iterative refinement, multiple high-performing candidate solutions (sfinalls_\text{final}^l for l=1Ll = 1 \ldots L) are retained.
  • Ensemble planning agent (𝒜ens_planner𝒜_\text{ens\_planner}): Proposes ensemble strategies (e.g., averaging predictions, stacking with meta-learners), utilizing feedback from historical ensembling efforts.
  • Iterative ensemble refinement: Over RR ensemble iterations, candidate plans are assessed and adjusted, selecting the ensemble output (senss_\text{ens}^*) with maximal validation score.

This explicit ensemble module distinguishes MLE-STAR from conventional AutoML agents and previous LLM solutions, systematically leveraging complementary strengths of diverse candidates for robust, high-variance reduction on competitive benchmarks.

4. Experimental Evaluation and Benchmark Achievement

MLE-STAR’s empirical results, as reported:

  • Performance on MLE-bench Lite (Kaggle competitions): The agent attains a medal rate of 64% on 22 tasks using Gemini-2.5-Pro LLM, substantially exceeding the baseline AIDE agent’s 25.8% and outperforming DS-Agent and AutoGluon in both medal count and average performance metrics.
  • Gold medal improvement: Ensemble strategies directly elevate the proportion of gold medals and solutions above the median score.
  • Component ablation analysis: Ablation experiments demonstrate the utility of targeted block refinement not only for individual metric improvement but also for boosting aggregate system outputs via ensembling.
  • Generalization across modalities: While current evaluation focuses on tabular tasks, the agent’s modular structure is designed for extensibility to other domains (e.g., image, text), contingent upon further experimental validation.

A plausible implication is that targeted block-level refinement, coupled with external retrieval and systematic ensembling, is a dominant methodology for agentic ML code optimization in competitive data science.

5. Comparison with Previous Agentic and AutoML Systems

MLE-STAR’s design specifically addresses limitations of prior agents:

  • Reliance on internal LLM knowledge: Previous LLM-based MLE agents typically generate solutions from model-internal knowledge, restricting adaptability to current best practices and limiting performance on task-specific benchmarks.
  • Coarse code-level exploration: Rather than whole-code mutations, MLE-STAR employs granular search through ablation-driven block replacement.
  • Integration of dynamic web knowledge: The agent can incorporate new state-of-the-art model architectures as soon as they are published and indexed, maintaining competitive viability over time.
  • Robust debugging and leakage prevention modules: Although ongoing work is required for complete robustness, these components enable higher reliability than AutoML agents reliant on static rule sets.

6. Future Directions and Research Implications

MLE-STAR’s modular agentic framework opens several avenues for future work:

  • Expansion to multimodal tasks: The targeted-ablation methodology can be adapted for image, text, and audio pipelines, with new block-selection heuristics and more elaborate code decomposition strategies.
  • Advanced ensemble search: Integrating richer ensembling protocols, such as non-linear blendings or meta-learned ensemble weights.
  • Robustness and reliability augmentation: Refinement of debugging and data leakage prevention modules, as undetected LLM misbehaviors can undermine automated solution validity.
  • Scalability to hierarchical ML workflows: Extension beyond one-step pipelines to complex multi-stage ML systems, leveraging the recursive composition and ablation logging mechanisms.

Given the design’s inherent ability to incorporate external knowledge and adapt to model innovations, MLE-STAR is positioned to remain effective as LLMs and ML methods evolve.


In summary, MLE-STAR operationalizes a multi-agent, retrieval-augmented, block-level refined approach to automated machine learning engineering. Its experimental results, component modularity, and targeted exploration strategy distinguish it as a significant advancement in competitive machine learning automation (Nam et al., 27 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MLE-STAR.