Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 89 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 93 tok/s Pro

Kimi K2 221 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Reinforcement Learning for Machine Learning Engineering Agents (2509.01684v1)

Published 1 Sep 2025 in cs.LG and cs.AI

Abstract: Existing agents for solving tasks such as ML engineering rely on prompting powerful LLMs. As a result, these agents do not improve with more experience. In this paper, we show that agents backed by weaker models that improve via reinforcement learning (RL) can outperform agents backed by much larger, but static models. We identify two major challenges with RL in this setting. First, actions can take a variable amount of time (e.g., executing code for different solutions), which leads to asynchronous policy gradient updates that favor faster but suboptimal solutions. To tackle variable-duration actions, we propose duration-aware gradient updates in a distributed asynchronous RL framework to amplify high-cost but high-reward actions. Second, using only test split performance as a reward provides limited feedback. A program that is nearly correct is treated the same as one that fails entirely. To address this, we propose environment instrumentation to offer partial credit, distinguishing almost-correct programs from those that fail early (e.g., during data loading). Environment instrumentation uses a separate static LLM to insert print statement to an existing program to log the agent's experimental progress, from which partial credit can be extracted as reward signals for learning. Our experimental results on MLEBench suggest that performing gradient updates on a much smaller model (Qwen2.5-3B) trained with RL outperforms prompting a much larger model (Claude-3.5-Sonnet) with agent scaffolds, by an average of 22% across 12 Kaggle tasks.

Summary

The paper introduces reinforcement learning methods for MLE agents that address challenges like variable-duration actions and sparse reward feedback.
It details duration-aware gradient updates and environment instrumentation to offer granular reward signals and faster, more stable convergence.
Experimental results on MLEBench show that RL-trained small models outperform larger prompted models by 22% across diverse Kaggle tasks.

Reinforcement Learning for Machine Learning Engineering Agents

Introduction

The paper "Reinforcement Learning for Machine Learning Engineering Agents" (2509.01684) presents a novel application of reinforcement learning (RL) to machine learning engineering (MLE) agents. Traditional approaches rely on prompting LLMs to solve MLE tasks, which does not allow for improvement through experience. This paper demonstrates that agents with smaller models that learn through RL can outperform larger static models in MLE tasks by addressing specific challenges with RL in this context.

Problem Definition and Approach

Challenges in RL for MLE Agents

The paper identifies two primary challenges in applying RL to MLE agents:

Variable-Duration Actions: Actions in MLE can have variable execution times, leading to asynchronous policy gradient updates that favor faster, potentially suboptimal, solutions.
Sparse Reward Feedback: Relying solely on test split performance as a reward provides limited feedback, treating nearly correct programs the same as those that fail entirely.

Proposed Solutions

To tackle these challenges, the paper introduces:

Duration-Aware Gradient Updates: This approach reweights policy gradient updates based on the execution duration of actions, preventing bias towards faster actions.
Environment Instrumentation: This method provides partial credit for intermediate steps, offering more granular feedback on actions by using a static LM to insert progress-tracking print statements in the code, which helps extract partial credit as reward signals.
Figure 1: Proposed framework overview with duration-aware gradient updates and environment instrumentation.

Framework and Implementation

Duration-Aware Gradient Updates

The paper proposes a duration-aware policy gradient update rule:

$\nabla_\theta J(\pi_\theta) = E_{\pi, \init, \Trans}\left[\sum_{k=0}^K \Delta t_k \cdot \nabla_\theta\log\pi_\theta(a_k|s_k) \cdot \hat{A}(s_k, a_k) \right]$

Where $\Delta t_k$ is the execution duration of action $a_k$ . This ensures fair consideration for actions with longer durations in policy updates.

Environment Instrumentation

Environment instrumentation uses a static LM to insert print statements that help track execution progress and assign partial credit according to whether high-level processes, like data loading and model training, are completed.

Figure 2: Environment instrumentation process overview, modifying agent-generated code for progress tracking.

Experimental Results

Performance Comparison

The paper's experimental results on MLEBench show that RL-trained smaller models, such as Qwen2.5-3B, outperform larger static models like Claude-3.5-Sonnet by 22% across 12 Kaggle tasks. RL training allows these models to surpass the performance of larger models prompted under the same evaluation conditions.

Figure 3: Performance comparison showing RL-trained small models outperforming prompted large models over time.

Ablation Studies

Duration-Aware Gradient: The paper shows that without duration-aware gradients, agents tend to converge toward faster, less optimal solutions.
Environment Instrumentation: The use of environment instrumentation leads to faster and more stable convergence during RL training by providing additional feedback.
Figure 4: Environment instrumentation ablation results showing improvement in task scores and convergence speed.

Conclusion

The paper demonstrates that RL can significantly enhance the performance of MLE agents with smaller models by addressing execution time variability and providing more detailed reward feedback. Future work could explore scaling RL to larger models and multi-task learning.

By leveraging RL, this approach offers a path toward developing MLE agents capable of self-improvement and potentially demonstrates broader implications for integrating RL with LLMs in complex task environments.

Overall, this paper contributes to advancing agent-based systems in MLE and highlights key strategies for optimizing RL frameworks to improve model efficiency and effectiveness.