ML-Master 2.0: Autonomous Agentic MLE

Updated 18 January 2026

ML-Master 2.0 is an autonomous agent architecture that achieves ultra-long-horizon autonomy in machine learning engineering by structuring memory into layered caches.
It employs Hierarchical Cognitive Caching, dividing context into working, mid-term, and long-term memories to maintain high-fidelity execution and strategic insight.
The system addresses limitations of static LLM context windows, achieving significant performance improvements in real-world MLE tasks, as evidenced by rigorous benchmarking.

ML-Master 2.0 is an autonomous agent architecture designed to achieve ultra-long-horizon autonomy in machine learning engineering (MLE), representing a critical advance in agentic science. It addresses fundamental limitations of LLM-based agents in high-dimensional, delayed-feedback environments, introducing Hierarchical Cognitive Caching (HCC) to enable scalable cognitive accumulation, strategic coherence, and state-of-the-art performance on real-world MLE tasks. ML-Master 2.0 demonstrates that structuring agentic memory beyond static context windows unlocks new regimes of autonomous scientific discovery and machine learning engineering (Zhu et al., 15 Jan 2026).

1. Ultra-Long-Horizon Autonomy in Machine Learning Engineering

Ultra-long-horizon autonomy refers to an agent's ability to maintain strategic coherence and iterative correction over experimental cycles unfolding across days or weeks. In scientific discovery and MLE, problem formulation involves a sequence of events

$\mathcal{E}_t = \{e_0, e_1, \dots, e_t\}$

where even-indexed $e_{2k}$ are environment events (e.g., task specification, feedback), and odd-indexed $e_{2k+1}$ are agent actions (e.g., code patches, plans). Agents receive feedback sparsely and with significant delay; the event history rapidly grows as the research iterates through debugging, exploration, and parallel runs.

Sliding-window context construction (e.g., “last 8K tokens”) suffers catastrophic limitations in this regime: strategic insights are evicted, feedback cannot be consolidated, and token limits are quickly exceeded. Existing LLM-based agents typically leverage one-shot planning or flat memory buffers, which fail to preserve both the high-fidelity execution traces crucial for debugging and the strategic summaries essential for long-range planning. They struggle to transfer knowledge across tasks or maintain mid-term insight under static context constraints (Zhu et al., 15 Jan 2026).

2. Hierarchical Cognitive Caching (HCC) Architecture

ML-Master 2.0 addresses context management as a multi-tiered memory hierarchy inspired by CPU caching, structurally differentiating agent experience by temporal scale. HCC is comprised of three explicit layers:

Layer	Temporal Focus	Contents/Functionality
$\mathcal{L}_1$	Working Memory	Raw event traces, code edits, logs for current phase; plans
$\mathcal{L}_2$	Mid-Term Memory	Phase-level summaries $\kappa_r$ generated by promotion $P_1$
$\mathcal{L}_3$	Long-Term Memory	Embeddings $\mathbf{h}_n$ of task descriptors, wisdom texts $w_n$

Working Memory ( $\mathcal{L}_1$ ): Holds complete traces for the current experimental phase, preserving full action and output fidelity for immediate debugging needs.

Mid-Term Memory ( $\mathcal{L}_2$ ): Stores compacted textual summaries of completed phases, each distilled from raw trajectories via the operator $P_1$ . After phase promotion, raw traces are evicted from $\mathcal{L}_1$ .

Long-Term Memory ( $\mathcal{L}_3$ ): Serves as a persistent, cross-task key–value store, wherein past task descriptors are embedded and paired with distilled wisdom texts via $P_2$ . Memory enables similarity-based retrieval to "warm-start" new tasks based on historical context.

A policy framework governs the context constructor, context promotion (phase- and task-level), and context hit decision across tiers. Explicit pseudocode for tier initialization, context prefetch, research phase cycling, trace summarization, and cross-task memory augmentation is provided (Zhu et al., 15 Jan 2026).

3. Cognitive Accumulation: Temporal Dynamics and Optimization

HCC is engineered to accumulate cognitive artifacts at three distinct time scales:

Execution scale (minutes to hours): $\mathcal{L}_1$ supplies raw traces for fast-turnaround debugging and fixing.
Phase scale (hours): $\mathcal{L}_2$ gradually accrues stable insights, ensuring strategic and mid-horizon consistency.
Multi-task scale (days to weeks): $\mathcal{L}_3$ aggregates cross-task wisdom, allowing the agent to leverage distilled experience and accelerate future tasks.

HCC implicitly optimizes: $\max_{\pi,\;P_1,P_2}\;\mathbb{E}_{\tau\sim\mathcal{T}} \left[ F\big(h(\mathcal{E}_{t_{\max}})\big) \right] \quad\text{s.t.}\quad t_{\max} \le 24\,\text{h},\; |C_t|\leq C_{\max}$ where $F$ is the performance functional, and context size is explicitly bounded (targeting $\sim$ 70 K tokens, compared to unbounded sliding-window growth). This regime ensures that context overflow does not degrade agent performance, and that relevant strategic and actionable knowledge is retained at each phase (Zhu et al., 15 Jan 2026).

4. Decoupling Execution and Strategy: Operational Flow

ML-Master 2.0's architecture enables a dual operational loop:

Immediate Execution Loop ( $\mathcal{L}_1$ focus): The agent processes code patches, executes runs, harvests logs, and performs rapid-fire iterative debugging. Recovery from errors and immediate adaptation are achieved via high-fidelity traces.
Strategic Planning Loop ( $\mathcal{L}_2, \mathcal{L}_3$ focus): At phase boundaries, the agent abstracts results ( $\kappa_p$ ), revises global plans, explores new algorithmic/featurization directions, and re-launches targeted phases. Cross-task memory ( $\mathcal{L}_3$ ) is dynamically consulted to inject previously acquired high-level insight.

This design decouples short-term reactivity from long-term guidance, overcoming the brittleness observed in agents with monolithic or static context memory (Zhu et al., 15 Jan 2026).

5. Experimental Protocol and Benchmarking

ML-Master 2.0 was evaluated on MLE-Bench, consisting of 75 real-world Kaggle competitions across tabular, vision, and NLP modalities, each agent constrained to 24-hour execution per task. The principal evaluation metric is "medal rate": the proportion of tasks where performance met or exceeded Kaggle bronze threshold, stratified by task complexity tier.

Model/System	Overall Medal Rate	Low Complexity	Medium	High	Valid Sub. Rate	Above-Median Rate
ML-Master 2.0	56.44%	75.8%	50.9%	42.2%	95.6%	63.1%
Prior ML-Master	29.3%	—	—	—	—	—
Leeroo* (closed)	50.7%	—	—	—	—	—

Ablation on MLE-Bench-Lite demonstrated that all three layers contribute: without $\mathcal{L}_1$ performance dropped to 22.7%, without $\mathcal{L}_2$ to 59.1%, without $\mathcal{L}_3$ to 54.5%, with full HCC achieving 72.7%. The relative improvement over the previous ML-Master baseline is +92.7% (Zhu et al., 15 Jan 2026).

6. Key Insights, Limitations, and Future Work

Key findings are:

Structural memory differentiation is essential; naive context window extension is insufficient for ultra-long-horizon autonomy.
Migration policies underpin the preservation of both high-fidelity execution traces (crucial for local debugging) and mid/long-term strategic consistency.
Cognitive accumulation via HCC yields marked improvements in environments characterized by sparse, delayed feedback.

Limitations comprise reliance on LLM-based summarizations ( $P_1, P_2$ ) for phase/task promotion, which governs the fidelity of $\mathcal{L}_2$ , $\mathcal{L}_3$ . Managing three caches and multiple LLM passes incurs nontrivial computational cost. The current evaluation scope is confined to computational MLE; physical tasks with richer feedback modalities remain unexplored.

Future directions include learning to optimize cache promotions with RL, dynamically adjusting cache size and summary granularity, integrating neural retrieval in $\mathcal{L}_3$ , and extending HCC to multi-agent and lab-automation environments (Zhu et al., 15 Jan 2026).

7. Relationship to the ML 2.0 Paradigm

The ML-Master 2.0 focus on automation and memory structuring for ultra-long-horizon MLE complements prior work on machine learning engineering paradigms such as "ML 2.0" (Kanter et al., 2018). ML 2.0 proposed a paradigm shift from discovery-dominated workflows ("ML 1.0") to rapid, goal-oriented, API-driven development of minimum viable data-driven models within an eight-week pipeline. Both works emphasize systematic abstraction: while ML 2.0 streamlines end-to-end engineering via reusable APIs for entitysets, labeling, feature synthesis, and deployment, ML-Master 2.0 optimizes the agentic process itself through hierarchical cognitive caching and memory migration, specifically to overcome scaling and strategic deficits of LLM-based agents in long-duration scientific workflows.

This suggests that as ML pipelines become increasingly autonomous and agentic, principles from both systematic engineering (ML 2.0) and cognitive memory architectures (ML-Master 2.0) will be central to next-generation AI systems capable of operating persistently, strategically, and at scale across complex, real-world domains.

Markdown Upgrade to Chat

References (2)

Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering (2026)

Machine learning 2.0 : Engineering Data Driven AI Products (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ML-Master 2.0.