LLM-Guided Evolution for Medical Decision Pipelines

Published 5 Jun 2026 in cs.CL and cs.NE | (2606.07342v1)

Abstract: Adapting LLMs to clinical workflows often requires costly fine-tuning or manual prompt and pipeline engineering. We study LLM-guided MAP-Elites evolution as an inference-time alternative for discovering medical decision strategies and provide an implementation repository at https://github.com/univanxx/llm_guided_evo_medical. We formulate urgency triage, interactive consultation, and medical image classification as evolutionary searches over executable artifacts optimized by task-specific fitness functions. Across all three settings, evolution improves over manually designed baselines under practical constraints. In triage, evolved programs increase Semigran accuracy from $77.3\%$ to $87.1\%$ and emergency recall from $0.60$ to $0.97$, while improving safety-weighted held-out MIMIC-ESI performance. In interactive consultation, evolved policies improve the accuracy--cost frontier across Llama-3, Qwen-3.5, and Gemma-4 and transfer to held-out iCRAFTMD. In PneumoniaMNIST, prompt-only evolution improves frozen MedGemma VLMs while preserving strict JSON outputs. Qualitative analysis shows that the gains come from interpretable program-level mechanisms, calibrated triage boundaries, targeted evidence acquisition, selective commitment, and finding-oriented visual decision rules, rather than superficial prompt rewording alone.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces an LLM-guided MAP-Elites evolutionary search that optimizes clinical decision pipelines using task-specific fitness functions.
It demonstrates significant improvements in triage, consultation, and image classification tasks, notably enhancing accuracy and reducing operational costs.
The approach offers a cost-effective, interpretable alternative to fine-tuning, bolstering auditability and safety in high-stakes medical environments.

LLM-Guided MAP-Elites Evolution for Medical Decision Pipelines

Introduction

The paper presents a formal analysis and empirical evaluation of LLM-guided MAP-Elites evolution as an inference-time alternative for adapting medical decision pipelines. The motivation stems from the high resource and labor cost associated with conventional LLM fine-tuning and manual pipeline engineering in clinical applications encompassing triage, interactive medical consultation, and image-based diagnosis. The authors propose evolutionary search over executable artifacts—decision programs, policies, and prompt modules—using a task-specific fitness for optimization.

Methodology

A unified framework is implemented leveraging the GigaEvo system, where a fixed (non-fine-tuned) LLM (gpt-oss-120b) is used as a mutator within a MAP-Elites quality-diversity evolutionary search. Candidate solutions are structured Python modules governing the decision policy, abstention criterion, or prompting logic, depending on the task. Task-specific fitness functions are defined to encode operational clinical priorities, e.g., emergency recall for triage or accuracy–cost trade-off for consultation.

For triage, both a vignette-based (Semigran) and a real-world (MIMIC-IV-ED-derived ESI) task are evaluated.
For interactive consultation, sequential policies for evidence acquisition and abstention are evolved in the MediQ simulation, with transfer tested on CraftMD.
For medical image classification, prompt-only optimization is applied to frozen MedGemma vision-LLMs on PneumoniaMNIST at various resolutions.

Search operates over program logic, prompt structure, voting/rule-based aggregation, and other executable strategies, not just raw prompt text.

Quantitative Results

Triage

On the Semigran vignette triage, evolutionary search yields a program that surpasses manually engineered baselines, increasing accuracy from 77.3% to 87.1% and emergency recall from 0.60 to 0.97. Notably, this surpasses virtually all non-human external LLM and symptom checker baselines, except for practicing physicians. Transfer to the Levine vignette set is safety-oriented, preserving perfect emergency recall but accepting conservative overtriage in non-emergency cases.

On MIMIC-ESI, the top evolved program (MIMIC-023) achieves 62.0% exact accuracy and 77.0% range accuracy on the held-out test split, reducing severe undertriage (1.2%) relative to baselines. It is competitive with state-of-the-art reference systems, including Claude 3 Sonnet, on analogous ESI benchmarks.

Interactive Consultation

Evolved policies on the MediQ MedQA benchmark achieve consistent improvements over classical abstention strategies across multiple LLM backbones:

For Llama-3-8B, accuracy is improved by +3.1pp with a nearly 90% reduction in mean Expert-token usage.
For Llama-3-70B, +3.6pp accuracy with a 67.6% reduction in interaction tokens. Transfer to Qwen-3.5-27B and Gemma-4-31B confirms these accuracy–cost advantages, and generalizability is validated on the unseen CraftMD split where aggregate Borda scoring remains optimal.

Medical Image Classification (PneumoniaMNIST)

Prompt-only evolution for MedGemma models results in strong test set gains across all image resolutions and both 4B/27B model sizes. The best evolved prompt reaches 84.5% accuracy (27B, 224×224) and up to 68.0% on the 4B model at 28×28 resolution, outperforming prior zero-shot VLM prompts and matching more complex pipelines, while strictly maintaining JSON output contracts.

Qualitative Analysis

The observed improvements stem from interpretable changes to program-level decision making:

In triage, recalibration of class boundaries and safety weighting emergent cases is achieved without reliance on explicit retrieval or external clinical knowledge bases.
In consultation, evolution discovers confidence-based commitment, targeted question selection based on key clinical features, evidence-balancing for hypothesis testing, and selective application of self-consistency mechanisms.
For vision tasks, evolved prompt programs transition from direct label requests to structured, finding-oriented checklists (e.g., explicit search for radiographic signs), with thresholding that adapts to the specific VLM and input resolution.

These patterns indicate that the evolutionary process systematically explores and fixes clinically meaningful operating points rather than optimizing superficial text or overfitting spurious correlations.

Practical and Theoretical Implications

The framework offers a viable path for adaptively improving frozen LLM-based clinical pipelines under explicit task constraints and safety priorities, with significant reductions in inference cost and human-in-the-loop engineering effort. From an optimization perspective, the observed success of LLM-guided MAP-Elites evolution over executable decision spaces endorses quality-diversity and evolutionary search as practical tools for real-world medical AI, complementing or replacing fine-tuning in resource- or data-constrained settings.

Furthermore, the inspection of evolved artifacts reveals auditability and program transparency, which are critical in high-stakes medical environments. The technique supports alignment with operational clinician priorities (e.g., undertriage minimization), controlled trade-off navigation, and explicit error analysis.

By shifting the adaptation cost from pre-deployment (fine-tuning, dataset curation) to inference-time search, this evolution-based strategy can enable rapid response to new clinical contexts or shifting regulatory demands, while preserving the reproducibility and modularity of pipeline modifications.

Limitations and Future Directions

Limitations include the risk of overfitting on small-vignette datasets (e.g., Semigran), distribution shifts and brittleness of program-level heuristics (as shown in strong safety specialization collapsing under missing data), and the lack of prospective or clinical trial validation. Additionally, the computational burden of the evolutionary pipeline, while less than model fine-tuning, is non-trivial for large search budgets or frequent redeployment.

Key future directions include:

Prospective, multi-center clinical validation of evolved pipelines.
Expansion to more complex multimodal and multi-agent workflows in medical AI (Sellergren et al., 7 Jul 2025).
Integration of stronger safety critics and distributional robustness constraints into the fitness/objective.
Extension beyond medical tasks to broader safety-critical domains where fine-tuning is limited or output constraints dominate.

Conclusion

LLM-guided, MAP-Elites evolutionary search provides a rigorous, inference-time alternative for adapting medical decision pipelines to complex and heterogeneous tasks. The method surpasses classical baselines on diverse benchmarks, with gains traceable to interpretable program logic and clinically motivated mechanisms. This approach bridges the gap between model-centric adaptation and direct, outcome-driven optimization of clinical workflows, with implications extending beyond medicine to general AI system alignment and robust decision automation.

Reference: "LLM-Guided Evolution for Medical Decision Pipelines" (2606.07342)

Markdown Report Issue