From Clinical Intent to Clinical Model: An Autonomous Coding-Agent Framework for Clinician-driven AI Development

Published 18 Apr 2026 in cs.CV | (2604.17110v1)

Abstract: Clinical AI development has traditionally followed a collaborative paradigm that depends on close interaction between clinicians and specialized AI teams. This paradigm imposes a practical challenge: clinicians must repeatedly communicate and refine their requirements with AI developers before those requirements can be translated into executable model development. This iterative process is time-consuming, and even after repeated discussion, misalignment may still exist because the two sides do not fully share each other's expertise. However, autonomous coding agents may change this paradigm, raising the possibility that clinicians could develop clinical AI models independently through natural-language interaction alone. In this study, we present such an autonomous prototype for clinician-driven clinical AI development. We evaluated the system on five clinical tasks spanning dermoscopic lesion classification, melanoma-versus-nevus triage, wrist-fracture detection (including a weakly supervised variant with only 5% bounding-box annotations), and debiased pneumothorax classification on chest radiographs. Across these settings, the system consistently developed models from clinician requests and achieved promising performance. Notably, in a debiased pneumothorax classification task on chest radiographs, where chest drains can act as a major confounder, the system successfully mitigated shortcut learning and nearly halved the model's reliance on chest drains. These findings provide proof of concept that autonomous coding agents may help shift clinical AI development toward a more clinician-driven paradigm, reducing the communication overhead and dependence on specialized AI developers. Although further validation and robustness assessment are needed, this study suggests a promising path toward making clinical AI development more accessible.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces an autonomous coding-agent that translates clinician intent into executable deep learning pipelines, reducing dependence on specialized AI teams.
It details iterative code refinement using supervised, mixed-supervision, and debiasing strategies, achieving significant improvements in key performance metrics.
The study underscores practical implications for clinician-driven AI, empowering experts with workflow automation and targeted confound mitigation.

Autonomous Coding Agents for Clinician-driven Clinical AI: Framework, Implementation, and Empirical Analysis

Introduction and Framework

This study introduces an autonomous coding-agent framework engineered to enable clinician-driven AI development through natural-language requests, eliminating the traditional dependence on specialized AI teams. The authors conceptualize the workflow shift as a replacement of the classical multi-party model (which is mediated by repeated and error-prone communication between clinicians and AI developers) with a direct interaction between clinicians and autonomous coding agents possessing cross-domain reasoning and robust code synthesis capabilities.

Figure 1: The proposed clinician-driven paradigm uses an autonomous coding agent to bridge medicine and AI, supplanting the communication-intensive conventional workflow.

The framework is grounded in three sequential components: a semantic parser that translates clinician intent into a structured task description, a task initializer that autogenerates a runnable deep learning pipeline, and an autonomous developer that engages in iterative code refinement informed by validation performance and clinician feedback. This architecture allows clinicians to describe tasks and clinically relevant constraints in unrestricted natural language—including objectives like confound mitigation and error-type prioritization—while retaining ongoing control through negotiation and explanation interfaces.

Figure 2: System overview showing translation of clinician need, parsing, code initialization, and autonomous code refinement with explanation and trade-off negotiation.

Experimental Evaluation

Supervised Settings

The framework was evaluated on standard supervised medical imaging tasks using public datasets: 8-class dermoscopic lesion classification, melanoma-versus-nevus discrimination with heightened sensitivity requirement, and pediatric wrist fracture detection with localization. In each case, the tasks were initialized solely from free-form clinician requests, without explicit machine learning language.

Empirical results consistently demonstrated that the agent could instantiate and then significantly improve model pipelines via codebase-level iteration. In dermoscopy, the refined model yielded an AUC of 0.9153 (95% CI: [0.9054, 0.9245]) from a baseline of 0.8786, and a corresponding F1 increase from 0.4675 to 0.6292. Specifically, melanoma detection sensitivity and specificity gains were both statistically significant, reflecting the clinical priorities embedded in the initial request. For melanoma-versus-nevus, AUC increased from 0.7754 to 0.9424, and sensitivity at 80% specificity from 0.6021 to 0.9089—matching the stated demand to avoid missed melanomas. For wrist-fracture detection, metrics such as mAP@50:95, precision, and F1 improved substantially post-refinement.

Figure 3: Test-set performance improvements for three supervised modeling tasks, with consistent gains after autonomous refinement.

Mixed-supervision and Weak-label Exploitation

The agent's capability to autonomously adapt training strategies to real-world constraints was tested in a mixed-supervision scenario for fracture detection, with only 5% of images having bounding-box annotations. Without explicit instruction regarding label sparsity, the agent identified the supervision structure and adopted a teacher–student pseudo-labeling approach, blending strong and weak supervision. Model refinement led to mAP@50:95 improvement from 0.5462 to 0.6360 and F1 from 0.6639 to 0.8133, with a moderate shift in recall–precision balance favoring missed reduction.

Figure 4: Validation and test metrics for the mixed-supervision wrist-fracture task: rapid gain via exploitation of weak supervision.

Confound Mitigation and Shortcut Reduction

A critical test of semantic precision and technical effectiveness involved debiasing a pneumothorax classifier trained on SIIM-ACR chest radiographs, where chest drains constitute a classical shortcut. The clinician request explicitly raised this confounding issue.

The system responded by orchestrating a multi-faceted debiasing pipeline: stratified batch sampling to decorrelate confounder and target, an adversarial gradient reversal head for confound suppression, and regularization via mixup and label smoothing. Holding out test data during iteration, the resulting model cut the proportion of drain-attributable false positives from 60% to 31% and reduced the partial correlation between pneumothorax score and chest-drain presence by 47%. Notably, the framework autonomously identified and executed classical deconfounding strategies given only the natural-language statement of the clinical objective.

Figure 5: Quantitative and qualitative reduction in shortcut reliance for drain-dependent pneumothorax prediction via autonomous debiasing protocol.

Theoretical and Practical Implications

This prototype operationalizes a novel interface between clinical intent and technical realization, contrasting classical AutoML and code suggestion paradigms by placing emphasis on natural-language expression of priorities, workflow-level automation, and iterative development. Unlike traditional AutoML, the agent optimizes for clinically meaningful, non-metric-aligned objectives, and differs from “AI scientist” proposals by positioning the clinician as task specifier and adjudicator, not observer.

The empirical studies reveal several core implications:

Technical feasibility: Autonomous agents can instantiate and refine nontrivial clinical models from unconstrained natural-language requests, provided the desired modeling techniques are well-represented in public knowledge and prior art.
Alignment: Iterative refinement trajectories are not only numerically effective, but also reflect the asymmetries and risk preferences characteristic of real clinical applications, such as prioritizing sensitivity or confound mitigation.
Adaptability: Agents detect and leverage weak supervision, regularization, and debiasing strategies autonomously, indicating utility in real-world, annotation-scarce scenarios.
Current limitations: Performance depends on the agent’s ability to parse and map clinical descriptions into established technical solutions. Tasks requiring deep specialized domain knowledge, entirely novel algorithms, or specification of latent/unknown risks remain challenges.
Shortcut correction: Shortcut mitigation is effective when the confound is pre-specified. Autonomous identification and correction of unknown shortcuts, or adversarial validation, are not addressed.

Future Directions

Future work must address semantic parsing alignment (ensuring technical objectives match clinical intent), robustness across heterogeneous datasets, and generalization to complex, multi-objective, or workflow-integrated deployment settings. Additionally, new interfaces for uncertainty communication, trade-off negotiation, and collaborative oversight will be essential for safe and trustworthy clinician–AI partnerships. This system’s strength lies in accessible, controllable workflow automation—not in replacing clinical judgement or AI research expertise.

Conclusion

This study establishes that existing autonomous coding agents can function as effective intermediaries between clinical intent and deep learning model development. By converting natural-language clinical requirements into working pipelines and refining them iteratively, these systems lower practical and cognitive barriers for domain experts and reframe model-building as a more accessible, intent-centric process. The key theoretical advance is the demonstration of workflow-level alignment between clinical priorities and technical execution, subject to known limitations regarding unseen confounders and reliance on established methods. The paradigm invites continued research into robust, reliable, and interpretable autonomy for clinical AI model development, with an emphasis on human–AI collaboration and interface design.

Markdown Report Issue