HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research (2508.02621v1)

Published 4 Aug 2025 in cs.AI, cs.CL, cs.LG, and cs.MA

Abstract: The efficacy of AI agents in healthcare research is hindered by their reliance on static, predefined strategies. This creates a critical limitation: agents can become better tool-users but cannot learn to become better strategic planners, a crucial skill for complex domains like healthcare. We introduce HealthFlow, a self-evolving AI agent that overcomes this limitation through a novel meta-level evolution mechanism. HealthFlow autonomously refines its own high-level problem-solving policies by distilling procedural successes and failures into a durable, strategic knowledge base. To anchor our research and facilitate reproducible evaluation, we introduce EHRFlowBench, a new benchmark featuring complex, realistic health data analysis tasks derived from peer-reviewed clinical research. Our comprehensive experiments demonstrate that HealthFlow's self-evolving approach significantly outperforms state-of-the-art agent frameworks. This work marks a necessary shift from building better tool-users to designing smarter, self-evolving task-managers, paving the way for more autonomous and effective AI for scientific discovery.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces HealthFlow, a self-evolving AI agent that refines its high-level strategies for healthcare research through meta planning.
The methodology employs a reflective feedback loop that transforms execution traces into a durable strategic knowledge base, optimizing clinical data workflows.
Experimental results on EHRFlowBench and other benchmarks demonstrate HealthFlow's superior performance compared to traditional fixed-strategy agent frameworks.

HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research

This paper introduces HealthFlow, a novel AI agent framework designed to address the limitations of static, predefined strategies in complex domains such as healthcare research. HealthFlow distinguishes itself through a meta-level evolution mechanism that allows it to autonomously refine its high-level problem-solving policies by distilling procedural successes and failures into a durable, strategic knowledge base. To facilitate reproducible evaluation, the authors introduce EHRFlowBench, a new benchmark featuring complex, realistic health data analysis tasks derived from peer-reviewed clinical research. Experimental results demonstrate that HealthFlow's self-evolving approach significantly outperforms state-of-the-art agent frameworks.

Architectural and Evolutionary Advantages

HealthFlow's architecture is designed to overcome the constraints of predefined workflows prevalent in existing AI agents. The core innovation lies in its meta-level strategic learning, enabling the agent to evolve its own problem-solving strategies. (Figure 1)

Figure 1: The self-evolving architecture of HealthFlow, illustrating its continuous learning loop involving task reception, strategic planning, execution, evaluation, and knowledge synthesis.

This is achieved through a reflective loop where the entire execution trace of a task is analyzed to synthesize abstract, structured knowledge. Unlike existing agents that primarily focus on improving tool usage or refining reasoning templates within a fixed cognitive architecture, HealthFlow learns to strategically manage the entire problem-solving process itself. This is facilitated by a team of specialized agents:

Meta Agent: Responsible for high-level strategic planning, translating research requests into executable plans, and dynamically incorporating accumulated knowledge.
Executor Agent: Translates strategic plans into concrete, tool-based operations, operating within a secure, isolated workspace.
Evaluator Agent: Provides immediate, task-specific feedback to drive iterative improvement within a single task attempt.
Reflector Agent: Synthesizes abstract, generalizable knowledge from successful execution traces, enabling long-term, meta-level evolution.

The capacity for adaptation is rooted in a closed-loop mechanism that transforms procedural execution into durable strategic knowledge. This process is centered on the generation and utilization of structured experiences, which are synthesized by the reflector agent and stored in a persistent experience memory. This memory functions as the agent's evolving strategic playbook, allowing the meta agent to retrieve relevant historical experiences and construct more efficient and robust plans.

EHRFlowBench: A New Benchmark for Healthcare Research

The paper addresses a critical gap in existing resources by introducing EHRFlowBench, a new benchmark designed to mirror the complexity of real-world research challenges in healthcare. (Figure 2)

Figure 2: Task category distribution in EHRFlowBench, showing the refinement from initial LLM-extracted tasks to a final set of curated tasks across core research categories.

The benchmark comprises realistic, evidence-grounded data analysis workflows systematically extracted from established publications. The creation process involves a two-stage, LLM-assisted and human-verified procedure: candidate paper screening and task extraction. The benchmark includes 110 tasks, with 100 for evaluation and 10 for training HealthFlow. The tasks are categorized into 10 major categories that cover the research lifecycle.

Experimental Evaluation and Results

The authors conduct comprehensive experiments to evaluate HealthFlow's performance against state-of-the-art agent frameworks. The experimental setup involves four benchmarks: MedAgentsBench, Humanity's Last Exam (HLE), MedAgentBoard, and EHRFlowBench. The baseline methods include general LLMs, medical LLMs, multi-agent collaboration frameworks, general agent frameworks, and biomedical agent frameworks.

The results demonstrate that HealthFlow substantially outperforms all baselines on EHRFlowBench and MedAgentBoard. (Figure 3)

Figure 3: Head-to-head performance of HealthFlow against leading agent frameworks on EHRFlowBench and MedAgentBoard, showing the distribution of task outcomes.

The ablation paper quantifies the contribution of HealthFlow's core components, showing that removing the feedback loop or the long-term experience memory degrades performance. Further analysis reveals that the choice of underlying LLMs is critical to HealthFlow's performance, with a more powerful reasoning model significantly boosting performance.

Case studies provide concrete illustrations of HealthFlow's self-evolving capabilities. For instance, HealthFlow demonstrates its ability to visualize the correlation between systolic and diastolic blood pressure, incorporating medically informed data validation steps that baseline agents fail to perform. (Figure 4)

Figure 4: Execution results of different methods on a MedAgentBoard task, highlighting HealthFlow's ability to perform essential data validation steps.

Human evaluations further confirm the practical utility and quality of HealthFlow's generated solutions, with expert evaluators showing a strong preference for HealthFlow across tasks. (Figure 5)

Figure 5: Distribution of votes from domain experts comparing solutions from different agent frameworks, showing a strong preference for HealthFlow.

Limitations and Future Directions

The paper acknowledges that HealthFlow's performance is limited by the capabilities of its underlying LLMs and the risk of distilling flawed heuristics from idiosyncratic successes. Future research directions include adapting the framework to other scientific fields, handling multi-modal inputs, and addressing the limitations of the experience synthesis process.

Conclusion

HealthFlow represents a significant advance in the field of AI agents for scientific research. By implementing a meta-level strategic planning and evolution mechanism, HealthFlow learns to evolve its own high-level orchestration policies from experience. The introduction of EHRFlowBench provides a valuable resource for evaluating and comparing AI agents in healthcare data analysis. The experimental results demonstrate that HealthFlow's adaptive, experience-driven strategy leads to superior performance in robustness, efficiency, and solution quality compared to state-of-the-art agent frameworks.

PDF Markdown

Follow-up Questions

Related Papers

Authors (10)

Tweets

https://twitter.com/AllThingsApx/status/1952736038243148028

https://twitter.com/TheAlibAi/status/1952728803714990340

alphaXiv

HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research (10 likes, 0 questions)