AutoMLGen: LLM-Based MLE Coding Agent

Updated 4 July 2026

AutoMLGen is an LLM-based coding agent for machine learning engineering tasks that integrates curated domain knowledge, Monte Carlo Graph Search, and fine-grained operators.
It leverages a graph-augmented search strategy to draft, debug, and ensemble complete ML pipelines under fixed computational budgets.
Empirical results on MLE-Bench demonstrate improved medal rates and submission quality compared to baseline AutoML systems.

AutoMLGen denotes an LLM-based coding agent for machine learning engineering tasks such as Kaggle competitions and MLE-Bench. In its 2025 formulation, it integrates a curated ML domain knowledge base, Monte Carlo Graph Search (MCGS), and fine-grained operators to draft, debug, improve, fuse, and ensemble end-to-end pipelines under a fixed budget (Du et al., 9 Oct 2025). The term also overlaps with a broader AutoML discourse in which “generalized AutoML” means fully automated machine learning without expert involvement or expert knowledge; that stronger notion is explicitly distinguished from the intermediate, human-scaffolded systems that dominate current practice (Liu, 2018).

1. Conceptual scope and historical placement

AutoML literature has long separated incremental automation of parts of the machine-learning workflow from the stronger ideal of full autonomy. One influential taxonomy divides the field into narrow AutoML and generalized AutoML. Narrow AutoML covers intermediate techniques that automate portions of the pipeline but still depend on expert-designed outer loops, search spaces, priors, or meta-procedures. Generalized AutoML is defined as fully automated ML without expert involvement or expert knowledge, and is explicitly linked to AGI or “strong AI” (Liu, 2018).

Within that taxonomy, the named system AutoMLGen belongs to the narrow operational regime. It is specialized for Machine Learning Engineering (MLE) settings such as Kaggle and MLE-Bench, where success depends on iterative metric optimization, feature engineering, training strategies, and competition heuristics rather than merely producing runnable code. Its design assumes a curated knowledge base, a bounded operator set, and a search budget of fixed duration, all of which place it well inside the contemporary agentic-AutoML landscape rather than the stronger AGI-adjacent ideal of generalized AutoML (Du et al., 9 Oct 2025).

This distinction matters because AutoMLGen is sometimes read as a claim to complete automation. The underlying research record suggests otherwise. The broader AutoML literature repeatedly argues that most existing systems remain dependent on human configuration at the outermost layer, and that additional automation often arrives with additional computational burden rather than with the elimination of expert priors (Liu, 2018). A plausible implication is that AutoMLGen should be understood less as the endpoint of AutoML and more as a high-performance system for a demanding, code-centric subdomain of it.

2. System architecture and execution model

AutoMLGen is organized as an end-to-end coding agent for MLE tasks. Its inputs are a task $T$ , including problem description, data access pattern, evaluation metric, submission format, and constraints. Its core components are an ML Domain Knowledge Base, an LLM backbone, Monte Carlo Graph Search, and a sandboxed execution environment. The outputs are the best pipeline $s^*$ found under budget, final submission files, and optionally ensembles of top- $K$ solutions (Du et al., 9 Oct 2025).

The execution loop begins with initialization. The task description is parsed; relevant priors $R_{KB}(T)$ are retrieved from the knowledge base; and the LLM uses the task together with retrieved knowledge to draft initial pipeline candidates. Search then proceeds over a graph of candidate solutions. Each node stores a pipeline together with code and metadata. Each iteration performs selection, expansion, simulation, and backpropagation: a candidate node is selected; a fine-grained operator expands it into a new candidate; the new code is executed to obtain a task metric and validity signal; and rewards are propagated along the primary search path (Du et al., 9 Oct 2025).

The search space is formalized over complete ML pipelines. States are full solutions, not isolated parameters. Actions are operator applications such as drafting, debugging, feature enhancement, competition-strategy modification, fusion, and ensembling. This makes AutoMLGen broader than classical configuration-only AutoML systems: it searches over code-level design decisions spanning preprocessing, model choice, training, inference, and submission generation rather than only over model families and hyperparameters.

The implementation used DeepSeek-R1-0528 as the backbone model. The execution environment runs candidate code locally, performs validation through Kaggle-style evaluation scripts, records metrics and errors, and maintains branch-level and global memories for later reuse and ensembling. The resulting architecture is neither a monolithic planner nor a pure tree searcher; it is a graph-augmented search-and-execution system whose state is grounded in actual code runs rather than symbolic plans alone (Du et al., 9 Oct 2025).

3. Domain knowledge base and fine-grained operator design

A central feature of AutoMLGen is its domain knowledge base, which is organized along three axes: model-level knowledge, data-level knowledge, and strategy-level knowledge. Model-level knowledge maps task and domain types to suitable model families and backbones, with usage guidelines. Data-level knowledge covers modality-specific preprocessing and feature-engineering principles. Strategy-level knowledge encodes “Kaggle-style” recipes such as test-time augmentation, ensembling strategies, pseudo-labeling, and cross-validation heuristics. The sources are described as open-source repositories, competition discussions, and manually curated materials (Du et al., 9 Oct 2025).

The knowledge base is retrieved as $R_{KB}(T)$ and injected into prompting both at initialization and during later refinement stages. This is not an incidental convenience layer. It is intended to mitigate cold-start failures, improve refinement quality, and stabilize convergence by supplying the LLM with domain priors that ordinary code-generation agents lack. In the reported ablation, a baseline MCTS+LLM agent without the knowledge base achieved a medal rate of 40.91% on MLE-Bench-Lite, whereas adding the knowledge base increased medal rate to 50.00% (Du et al., 9 Oct 2025).

Search behavior is mediated by a set of fine-grained operators rather than by coarse whole-program rewrites. The operator set includes Draft, Debug, Improve-Normal, Improve-FE, Improve-CS, Fusion, Code Review, and Ensemble. Draft creates initial solutions or new branches. Debug repairs non-executable code with minimal localized edits. Improve-Normal adjusts standard performance-related settings such as learning rate or batch size. Improve-FE targets features and preprocessing. Improve-CS injects competition strategies such as pseudo-labeling or improved cross-validation. Fusion merges information from multiple nodes. Code Review checks for issues such as data leakage or metric-task mismatch. Ensemble combines top- $K$ solutions near the end of search (Du et al., 9 Oct 2025).

This operatorization is important because it decomposes MLE into semantically distinct edit types. A plausible implication is that AutoMLGen gains not only from stronger priors, but from a better factorization of the action space than generic code agents employ. The literature on adjacent systems reaches similar conclusions in other forms: LightAutoDS-Tab uses specialized planner, generator, validator, improver, and AutoML-router agents for tabular tasks, while NNGPT separates architecture synthesis, HPO, prediction, retrieval, and RL into distinct pipelines (Lapin et al., 17 Jul 2025, Kochnev et al., 25 Nov 2025).

4. Monte Carlo Graph Search

AutoMLGen’s search algorithm is Monte Carlo Graph Search, a graph-augmented variant of MCTS. The search structure is a directed graph $G = (V, E)$ with $E = E_T \cup E_{\text{ref}}$ . Nodes are candidate solutions. Primary edges $E_T$ represent generative parent-child relations and form the tree backbone used for selection and backpropagation. Reference edges $E_{\text{ref}}$ connect nodes across branches and levels for information flow and reuse, but do not participate in credit assignment (Du et al., 9 Oct 2025).

Selection remains tree-based. It traverses only primary edges by UCT, using node visit counts, accumulated rewards, smoothing $s^*$ 0, and an exploration constant set to 1.414. The novelty lies in expansion. AutoMLGen implements four expansion modes: primary expansion, intra-branch evolution, cross-branch reference, and multi-branch aggregation. Primary expansion behaves like standard MCTS. Intra-branch evolution lets the operator inspect recent nodes in the same branch, enabling reflection over local trajectory history. Cross-branch reference exposes the current branch to globally strong nodes from elsewhere in the search graph. Multi-branch aggregation creates a new branch from a union of top trajectories across branches, enabling explicit multi-solution fusion (Du et al., 9 Oct 2025).

Backpropagation is intentionally restricted to primary edges. This preserves the stability and interpretability of MCTS-style credit assignment while allowing the graph structure to influence generation through references. The result is a hybrid between tree-guided exploration and graph-mediated knowledge sharing. The design is explicitly motivated by limitations of linear or tree-structured search, where information transfer is local and strong solutions discovered in one branch remain isolated from others (Du et al., 9 Oct 2025).

The contrast with earlier AutoML search paradigms is sharp. AlphaD3M casts pipeline synthesis as a single-player game with edit operations over pipeline primitives and uses MCTS guided by a learned policy-value model, but its search remains tree-structured over pipeline edits (Drori et al., 2021). GAMA exposes pluggable random search, ASHA, and asynchronous evolutionary search over scikit-learn pipelines, emphasizing transparency and modularity rather than cross-branch recombination (Gijsbers et al., 2020). AutoMLGen extends the search space to code-level MLE decisions and extends the search structure from trees to graphs.

5. Benchmarking and empirical performance

AutoMLGen was evaluated on MLE-Bench, described as a benchmark of 75 Kaggle competitions spanning NLP, vision, signal processing, and tabular tasks, with 22 low-complexity, 38 medium, and 15 high complexity tasks. The reported environment used 32 Intel Xeon vCPUs, 230GB RAM, 1× NVIDIA A800 GPU, and a 12-hour budget per task, averaged over 3 seeds (Du et al., 9 Oct 2025).

On the full MLE-Bench under that 12-hour budget, AutoMLGen achieved an average medal rate of 36.4 ± 1.2 %, with 62.1 ± 3.0 % on low-complexity tasks, 26.3 ± 2.6 % on medium, and 24.4 ± 2.2 % on high. It also achieved a valid submission rate of 96.4 ± 0.4 %, a Median+ rate of 48.4 ± 1.2 %, and a gold medal rate of 18.7 ± 0.8 %. The reported comparison places it above the second-best average medal rate, Neo at 34.2% with 36h runtime, and above ML-Master at 29.3% average medal rate under the same 12h budget (Du et al., 9 Oct 2025).

On MLE-Bench-Lite, AutoMLGen reached 62.1 ± 3.0 % medal rate, ahead of reported baselines including MLZero: 36.4%, MLE-Star: 43.9 ± 6.2%, AIRA-dojo: 47.7%, and KompeteAI: 51.5 ± 1.5% (Du et al., 9 Oct 2025). For high-complexity tasks, the paper notes that medal rates are similar to ML-Master and Neo at 24.4%, but AutoMLGen shows a higher win rate in average scores, suggesting stronger average solutions across that subset (Du et al., 9 Oct 2025).

The ablation study isolates the contributions of the knowledge base and the graph search design. On MLE-Bench-Lite with a single seed, the baseline MCTS+LLM agent achieved 40.91% medal rate, 68.18% Median+, and 65.33% beat ratio. Adding the knowledge base raised these to 50.00%, 77.27%, and 68.59%. Adding intra-branch MCGS raised them further to 59.09%, 81.82%, and 73.20%. Full AutoMLGen with knowledge base and full MCGS reached 68.12% medal rate, 86.36% Median+, and 78.33% beat ratio (Du et al., 9 Oct 2025).

These results matter because they attribute AutoMLGen’s gains to structural components rather than to the base LLM alone. A common misconception is that the system’s performance is reducible to a stronger model backend. The ablation record indicates that curated priors, trajectory reuse, cross-branch reference, and multi-branch aggregation are all substantial contributors.

AutoMLGen sits at the intersection of several research trajectories. One line emphasizes transparent, modular AutoML frameworks such as GAMA, which exposes pluggable search algorithms and post-processing for scikit-learn pipelines (Gijsbers et al., 2020). A second line treats pipeline synthesis as structured search, exemplified by AlphaD3M, which uses self-play, sequence models, and MCTS over pipeline edit operations (Drori et al., 2021), and by grammar-constrained variants that prune the search space for valid pipelines (Drori et al., 2019). A third line combines LLMs with existing AutoML tools, as in LightAutoDS-Tab, which routes among LLM-generated code, LightAutoML, and FEDOT for tabular tasks (Lapin et al., 17 Jul 2025). A fourth line pushes toward autonomous ML-engineering agents, such as KompeteAI, which adds a merging stage, RAG, and a predictive scoring model that accelerates pipeline evaluation 6.9 times (Kulibaba et al., 13 Aug 2025), and NNGPT, which uses a closed loop of generation, assessment, and self-improvement for neural-network development (Kochnev et al., 25 Nov 2025).

AutoMLGen’s distinctive contribution within that landscape is its explicit use of graph-structured search for code-level MLE optimization. It does not merely search over pipeline templates, nor does it only orchestrate external AutoML frameworks. Instead, it treats candidate solutions as nodes in a graph, permits cross-branch knowledge flow through reference edges, and combines those search mechanics with a domain knowledge base and operator-specialized prompting (Du et al., 9 Oct 2025).

The strongest conceptual caution concerns its name. In the broader AutoML literature, generalized AutoML denotes the final goal of fully automated ML without expert involvement or expert knowledge, and that goal is tied to unresolved AGI-scale obstacles (Liu, 2018). The 2025 system called AutoMLGen is not that. It relies on a curated knowledge base, task-specific operator design, MLE-Bench-style local evaluation, and a fixed 12-hour wall-clock search. This suggests that the term functions more as a productively ambitious label than as a literal realization of generalized AutoML.

Its reported limitations are correspondingly concrete. Performance depends on foundation-model quality. The system incurs substantial compute and time cost, requiring up to 12h per task with GPU-backed execution. Its knowledge base is tuned to Kaggle-like ML engineering, and the paper notes possible brittleness on very novel domains or on tasks insufficiently covered by the curated priors. Future directions include extending beyond MLE-Bench, introducing more explicit multi-step and decomposed code generation, and broadening the domain scope beyond the current competition-centric setting (Du et al., 9 Oct 2025).