PathFound: Agentic Multi-Turn Diagnostics
- PathFound is an agentic multimodal system that employs iterative evidence-seeking strategies to refine diagnoses in computational pathology.
- It integrates visual foundation models, vision–language models, and a reasoning agent within an MDP framework to continuously update diagnostic hypotheses.
- The system achieves state-of-the-art performance in tumor subtyping, grading, and invasion detection through a multi-stage workflow optimized via reinforcement learning.
PathFound is an agentic multimodal system for computational pathology that operationalizes evidence-seeking diagnostic reasoning. By integrating pathological visual foundation models, vision–LLMs, and a reasoning agent trained with reinforcement learning, PathFound implements a hypothesis-driven, multi-turn workflow analogous to that of clinical pathologists. Unlike conventional “read-once, predict-once” approaches, PathFound iteratively refines diagnoses by actively acquiring targeted evidence, including re-examinations of image regions and external test requests, structured within a Markov Decision Process (MDP) framework. This architecture enables state-of-the-art results across tumor subtyping, grading, and invasion detection tasks, and achieves superior fine-grained recognition of subtle histopathological features such as nuclear abnormalities and local invasion (Hua et al., 29 Dec 2025).
1. Evidence-Seeking Diagnostics: Motivation and MDP Framework
Traditional computational pathology frameworks typically employ a static inference regime: whole-slide images (WSIs) are partitioned, processed in a single forward pass, and mapped directly to diagnostic outputs. This “one-pass” paradigm contrasts with clinical diagnostic workflows, in which practitioners iteratively form and revise hypotheses, revisit morphological details, and trigger ancillary testing if ambiguity persists.
PathFound explicitly formulates the diagnostic process as an MDP :
- State : Aggregated evidence up to turn , including clinical inputs, textual findings, regions of interest (RoIs), and test outcomes.
- Action : Corresponds to slide re-highlighting, lab test invocation, or diagnostic output.
- Transition : State is deterministically updated by appending new evidence generated by the selected tool or test.
- Reward : Composite, rewarding correct diagnosis ranking, coherence among evidence, and penalizing format violations or hacky outputs.
The agentic reasoner’s policy is learned to maximize expected return:
Optimization is performed via a generalized policy-gradient RLVR algorithm (GRPO) (Hua et al., 29 Dec 2025).
2. Model Architecture
PathFound comprises three principal modules orchestrated within a diagnostic “outer loop”:
- Slide Highlighter (Vision Foundation Model, VFM): Utilizes a frozen UNI-2 ViT backbone for patch-level feature extraction. A morphology prototype library encodes both coarse (e.g., tumor/normal) and fine (e.g., nuclear grade) entities, supporting flexible region-level and entity-level grounding. RoIs are sampled by maximizing cosine similarities between WSI patch embeddings and task-relevant prototypes.
- Vision Interpreter (Vision-LLM, VLM): Employs a Qwen2.5-VL-7B-Instruct backbone. Visual RoIs are encoded, aligned to language space, and decoded into medical captions. The VLM is trained with cross-entropy losses for image–caption alignment (885K pairs) and instruction tuning (1.28M pairs), with integrated visual in-context learning using up to reference images.
- Diagnostic Reasoner (LLM + RLVR): With a Qwen2.5-32B-Instruct backbone, the reasoner is first supervised with 1,529 structured cases, then fine-tuned by reinforcement learning (GRPO) using reward shaping as described above. The agent’s policy is encoded via LLM prompting and tool call wrappers. The RL loss is:
where is the critic-derived advantage.
3. Diagnostic Workflow
PathFound advances through three recurrent stages per clinical scenario:
- Exploration (Initial Diagnosis): The agent highlights a fixed number of RoIs using a generic toolkit (e.g., “PanCancerToolkit,” 8 RoIs at 10x/20x), interprets them through the VLM, and proposes both a differential diagnosis and a diagnostic plan (further tool calls or lab tests).
- Execution (Evidence Acquisition): The agent sequentially realizes the diagnostic plan—invoking targeted toolkits for refined slide highlighting or simulating test results through an external LLM (RAGES). New findings are incrementally integrated into the state .
- Exploitation (Final Decision): The accumulated evidence is synthesized, inconsistent hypotheses are pruned, and the finalized diagnostic output is produced.
The agentic loop is formalized by the following pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 |
s ← {clinical_info}
for stage in [Exploration, Execution, Exploitation]:
a ← π_θ(s)
if a is CALL_TOOL(tool_i):
RoIs ← Highlight(s, tool_i)
obs ← Interpret(RoIs)
s ← s ∪ obs
elif a is ASK_TEST(test_j):
result ← SimulateTest(test_j)
s ← s ∪ result
elif a is OUTPUT_DX(dx_list):
break |
4. Training Regime and Optimization
The training protocol comprises supervised and RL stages:
- Vision Interpreter: Trained for 3 epochs with AdamW (, batch 32, 8×H200 GPU config).
- Diagnostic Reasoner: RL with GRPO, actor , 20 episodes per case, rollout batch=64, microbatching 4. Reward weights set at , , .
- Aggregate Loss: , adjusting to balance SL priming and RL stability.
5. Empirical Performance Across Clinical Scenarios
Evaluation employed multiple clinical scenes and datasets:
- Renal cell carcinoma (TCGA-RCC: 186 WSI, Xijing: 58 WSI, SCC: 101 WSI)
- Prostate adenocarcinoma with Gleason grading (TCGA-PRAD: 447 cases)
- Pan-cancer invasion detection (TCGA-Invasion: 207 cases)
Key metrics included balanced accuracy (BAcc) for RCC, standard accuracy (Acc) for PRAD and invasion, and specialized indices such as Proactive Evidence Mention Rate (PEMR), nuclear grading accuracy, and invasion F1.
| Model | OP-TCGA-RCC | ES-TCGA-RCC | OP-PRAD | ES-PRAD | OP-Invasion | ES-Invasion |
|---|---|---|---|---|---|---|
| Qwen3-VL | 34.1% | 44.9% | 78.1% | 86.3% | 59.4% | 72.5% |
| InternVL3.5 | 33.3% | 54.8% | 79.0% | 89.3% | 62.3% | 77.3% |
| Gemini-2.5 | 31.0% | 66.6% | 81.3% | 90.6% | 67.2% | 80.2% |
| GPT-5 | 38.3% | 71.5% | 68.9% | 92.5% | 61.8% | 79.2% |
| PathFound | 59.2% | 92.3% | 84.6% | 92.2% | 69.6% | 81.6% |
Fine-grained highlights for PathFound:
- Nuclear grading (Xijing): Accuracy 64.8%, PEMR 96.8%
- Gleason combined (9-way): C.Acc 40.7%, PEMR 79.6%
- Invasion detection: F1 71.5%, PEMR 4.3%
Ablations indicate that integration of multi-modal evidence (additional tool invocations or further test results) yields gains up to +33.0% over one-pass baselines. The best-performing configuration combined RoIs at varying magnifications and reference in-context images.
6. Evidence-Seeking Behavior and Explainability
PathFound autonomously invokes specialized analysis tools, such as nuclear-grading algorithms, in 96% of high-grade cases, exceeding the capabilities of models restricted to a static, single-pass paradigm. For pan-cancer invasion detection, PathFound achieves balanced precision/recall of 67.7%/75.8%—a 14-point F1 advantage over leading “one-pass” models.
The MDP-driven, multi-stage workflow enables explicit recording of diagnostic trajectories, including which RoIs were selected, what observations were generated, and how additional evidence directed subsequent decisions. A plausible implication is the potential for full traceability and auditing of AI decision-making, aligning with clinical standards for interpretability.
7. Comparison to PathFinder and Broader Implications
Both PathFound and “PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology” (Ghezloo et al., 13 Feb 2025) depart from “read-once” methods by modeling iterative evidence synthesis. PathFinder operationalizes four collaborating agents (Triage, Navigation, Description, Diagnosis), proceeding through cascaded visual and linguistic evidence gathering. PathFound unifies this paradigm within an RL-optimized, agentic MDP loop, supporting not only region-level re-examination but also external test integration and explicit planning. The superior gains reported for PathFound in balanced accuracy, fine-grained classification, and explainability suggest that multi-turn, evidence-seeking workflows are highly effective for computational pathology, especially in scenarios with diagnostic ambiguity or subtle morphologic cues.
PathFound formalizes evidence-seeking diagnosis with clear mathematical rigor and achieves state-of-the-art results on subtyping, grading, and invasion detection tasks. Its modular, agentic design and reward-shaped optimization provide a robust blueprint for closing the gap between static computational inference and real-world clinical reasoning (Hua et al., 29 Dec 2025).