R&D-Agent: Automating Data-Driven AI Solution Building Through LLM-Powered Automated Research, Development, and Evolution
(2505.14738v1)
Published 20 May 2025 in cs.AI
Abstract: Recent advances in AI and ML have transformed data science, yet increasing complexity and expertise requirements continue to hinder progress. While crowdsourcing platforms alleviate some challenges, high-level data science tasks remain labor-intensive and iterative. To overcome these limitations, we introduce R&D-Agent, a dual-agent framework for iterative exploration. The Researcher agent uses performance feedback to generate ideas, while the Developer agent refines code based on error feedback. By enabling multiple parallel exploration traces that merge and enhance one another, R&D-Agent narrows the gap between automated solutions and expert-level performance. Evaluated on MLE-Bench, R&D-Agent emerges as the top-performing machine learning engineering agent, demonstrating its potential to accelerate innovation and improve precision across diverse data science applications. We have open-sourced R&D-Agent on GitHub: https://github.com/microsoft/RD-Agent.
The paper "R&D-Agent: Automating Data-Driven AI Solution Building Through LLM-Powered Automated Research, Development, and Evolution" (Yang et al., 20 May 2025) introduces a novel dual-agent framework designed to automate the complex and iterative process of creating data-driven AI solutions. The core challenge addressed is the increasing complexity of data science tasks, which demand significant expertise and time, even with advancements like crowdsourcing platforms. R&D-Agent aims to bridge the gap between automated solutions and human expert-level performance by employing LLM-powered agents for research, development, and evolution.
The framework's architecture, illustrated in Figure 1, revolves around two specialized agents:
Researcher Agent: This agent is responsible for high-level strategic thinking and idea generation. It processes performance feedback from previous iterations to propose new research directions or refine existing ideas. Its learning process involves accumulating knowledge into a knowledge base from past experiments or external sources.
Developer Agent: This agent focuses on the practical implementation of the ideas proposed by the Researcher. It takes the high-level natural language descriptions and translates them into executable code. A key aspect of its operation is iterative refinement based on execution error feedback.
Dedicated R&D Roles and Implementation
The separation of roles is a cornerstone of the R&D-Agent design, drawing an analogy to human research and development teams.
Researcher Agent:
Function: Proposes research ideas, analyzes performance feedback, refines strategies.
Input: Task description, performance metrics from previous attempts, knowledge base.
Output: High-level research plan or hypothesis in natural language.
LLM Choice: Can leverage LLMs strong in reasoning and creative ideation (e.g., "o1" or "o3" as mentioned in experiments).
Process:
1. Receives task and current state (e.g., performance of last solution).
2. Consults its knowledge base (continuously updated).
3. Generates a new idea or modification to an existing one.
4. Passes the idea to the Developer agent.
Developer Agent:
Function: Implements the Researcher's ideas, debugs code, ensures solutions are runnable and practical.
Input: Research idea (natural language), dataset, execution logs/error messages.
Output: Executable code, performance results on a dataset.
LLM Choice: Can utilize LLMs proficient in instruction following and code generation (e.g., "GPT-4.1").
Tests and debugs the code on a sampled subset of the data. This significantly speeds up the development cycle by allowing for rapid iteration and error correction before running on the full, potentially large, dataset.
If errors occur, the Developer agent analyzes the logs and attempts to fix the code, potentially querying the LLM again with the error context.
2. Phase 2: Run Solution on Full Dataset:
Once a runnable solution is achieved on the sample, it's executed on the full dataset.
Performance metrics and any new execution issues are reported back to the Researcher agent.
This separation allows for tailoring LLM choices to task strengths, potentially using a more creative model for research and a more meticulous one for development.
Multi-Trace Idea Explorations
To overcome the limitations of a single exploration path, which can lead to suboptimal solutions, R&D-Agent implements a multi-trace exploration mechanism.
Motivation: A single agent's exploration is constrained by its initial configuration (LLM, prompts, tools, knowledge). Multiple diverse traces increase the chance of finding better solutions.
Implementation:
Parallel Execution: Multiple exploration traces can run concurrently, each potentially with different configurations (e.g., different LLMs for Researcher/Developer, varied prompt strategies, distinct toolsets, or knowledge bases).
Heterogeneity: This diversity is key to exploring different parts of the solution space.
Scalability: The system is designed for both logical and physical parallelism, allowing it to scale across distributed environments (compute nodes, containers, threads).
Cross-Trace Collaboration:
Information Sharing: Traces can share intermediate results, such as effective feature sets, partial models, or even failure logs. For instance, a new trace can be initialized with knowledge of what failed in a previous trace to avoid redundant effort.
Centralized Tracking & Dynamic Decisions: A module can monitor the performance profiles of all traces (solution quality, novelty, resource cost, error resilience). Based on this, it can:
Terminate unproductive traces.
Spawn new traces, perhaps from a promising checkpoint of an existing trace but with modified configurations.
Trace A (LLM_R1, LLM_D1, PromptSet1) -> Reaches checkpoint C_A with performance P_A
If P_A is promising but progress slows:
Spawn Trace B from C_A with (LLM_R2, LLM_D1, PromptSet2)
Spawn Trace C from C_A with (LLM_R1, LLM_D2, PromptSet1, AdditionalToolX)
Multi-Trace Fusion
The culmination of multi-trace exploration is the ability to merge insights and components from different successful traces into a superior composite solution.
Goal: Combine complementary strengths discovered by individual traces.
Process:
Component Integration: Fusion can occur at various granularities, combining, for example:
Feature engineering techniques from Trace 1.
Model architecture from Trace 2.
Post-processing heuristics from Trace 3.
Evaluation and Scoring: Components from different traces are evaluated and scored based on utility, novelty, compatibility, and performance impact.
Fusion Strategy: Configurable strategies like greedy selection, weighted voting, or even optimization-guided fusion assemble the final solution.
Customization: Users can define domain-specific rules for:
Trace Evolution: Early stopping criteria, conditions for spawning new traces (e.g., performance thresholds, time limits).
Information Exchange: What intermediate outputs (code, logs, metrics) are shared and when.
Fusion Phase: Compatibility rules between components, aggregation functions, custom scoring models for components.
For example, during fusion, if Trace A found a highly effective data preprocessing pipeline and Trace B developed a robust model training script, the fusion process might combine these two. The R&D-Agent could use an LLM to assess compatibility or even generate "glue code" if needed.
Experimental Evaluation
R&D-Agent was evaluated on MLE-Bench, a benchmark based on Kaggle competitions, with a 24-hour time limit and a specified virtual environment (12 vCPUs, 220GB RAM, 1 V100 GPU).
Key Findings (Table 1):
Dedicated Roles Advantage: R&D-Agent using a single LLM (o1-preview for both roles) significantly outperformed the AIDE o1-preview baseline (e.g., 22.4% vs 16.9% overall). This suggests the structural benefit of the Researcher-Developer split.
Hybrid LLM Strategy: Using "o3" for the Researcher and "GPT-4.1" for the Developer (o3(R)+GPT-4.1(D)) yielded strong results (22.45% overall), comparable to or better than the single strong LLM baseline, demonstrating the effectiveness of matching LLM strengths to roles.
Multi-Trace Effectiveness: The configuration with multi-trace exploration and fusion (o3(R)+GPT-4.1(D)-Multi.Trace) achieved the best overall performance (24.00% ± 0.94).
In this setup, two independent traces ran for 11 hours each. The second trace was informed by the first trace's history (failures, explorations) to enhance diversity.
A final 2-hour fusion phase merged code modules, ideas, and feedback. The best solution (from individual traces or the fused one) was selected.
Practical Implications and Applications
Automation of Complex AI/ML Pipelines: R&D-Agent can automate significant portions of the machine learning engineering workflow, from initial data exploration and hypothesis generation to model implementation, debugging, and refinement.
Accelerated R&D Cycles: By parallelizing exploration and learning from multiple attempts, the system can potentially arrive at high-quality solutions faster than manual or simpler automated approaches.
Democratization of Expertise: The framework can encapsulate sophisticated problem-solving strategies, making advanced AI/ML development more accessible.
Code Generation and Refinement: The Developer agent's ability to iteratively debug code based on sampled data and error logs is a practical approach to robust code generation.
Customizable and Extensible: The framework's modularity (dedicated agents, multi-trace, configurable fusion) allows users to integrate domain-specific knowledge, tools, and LLMs.
Open Source: The R&D-Agent is open-sourced (https://github.com/microsoft/RD-Agent), encouraging community contributions and broader application. This allows practitioners to directly implement, modify, and extend the system for their specific use cases, such as financial modeling, healthcare diagnostics, or industrial process optimization.
Implementation Considerations
LLM API Costs & Latency: Heavy reliance on LLM calls (for idea generation, code generation, debugging, fusion analysis) can incur significant costs and latency. The strategy of debugging on sampled data helps mitigate this for the development phase.
Knowledge Base Management: Effectively creating, maintaining, and querying the knowledge base used by the Researcher agent is crucial for long-term learning and improvement.
State Management: Managing the state of multiple parallel traces, including their code, data, performance metrics, and intermediate results, requires robust infrastructure.
Prompt Engineering: The quality of prompts fed to the LLMs for both research ideation and code development will significantly impact performance.
Resource Management: Running multiple traces, especially those involving model training, can be computationally intensive. The system needs efficient resource allocation and scheduling.
Evaluation of Partial Solutions: Developing effective heuristics or models to evaluate the promise of partial solutions or components during fusion is a non-trivial challenge.
Future Work
The paper positions itself as a technical report with preliminary results. Future work includes providing more technical details and comprehensive experimental results. The authors also suggest that the R&D-Agent framework is flexible enough to be applied to other research and development scenarios beyond machine learning engineering. Ongoing work mentioned includes exploring alternative early-stop policies, injecting domain knowledge into traces, and adaptive fusion timing.
In summary, R&D-Agent offers a structured and powerful approach to automating AI solution development by mimicking human R&D processes with specialized LLM agents and leveraging parallel, collaborative exploration. Its practical design choices, such as two-phase development and multi-trace fusion, combined with its open-source nature, make it a promising tool for tackling complex data science problems.