- The paper introduces a multi-agent, training-free LLM framework that leverages temporal reasoning over longitudinal EHR for multi-cancer early detection.
- It employs a sequential chain-of-agents and persistent long-term memory to extract and integrate time-stamped clinical events, enhancing interpretability and prediction accuracy.
- The system achieves competitive AUROC scores across 15 cancer types, outperforming classical models on lung cancer and supporting scalable, unsupervised cohort-level analysis.
TrajOnco: Multi-Agent Temporal Reasoning for Multi-Cancer Early Detection from Longitudinal EHR
Introduction
TrajOnco introduces a scalable, training-free, multi-agent LLM framework targeting multi-cancer early detection through structured temporal reasoning over longitudinal Electronic Health Records (EHR). The method addresses critical deficits in temporal pattern extraction and interpretability that limit the clinical adoption of both classical ML and single-agent LLM approaches for long-horizon oncologic risk stratification. TrajOnco leverages a sequential chain-of-agents (CoA) architecture and a persistent long-term memory (LTM) module, enabling interpretable synthesis of temporally structured, event-centered evidence driving cancer prediction across heterogeneous patient trajectories.
Figure 1: The TrajOnco framework employs sequential worker agents, persistent memory, and a manager agent to process longitudinal EHR data for patient-level cancer risk and evidence narratives.
Framework and Methodological Design
TrajOnco structurally instantiates the CoA paradigm by transforming each patient's longitudinal EHR into a temporally ordered XML corpus, which is partitioned into context-aware chunks. Each chunk is analyzed by a dedicated worker agent, which outputs both local risk assessments and extracts high-salience, timestamped candidate events for LTM. Deduplication and time-aware chunking strategies are employed to combat context degradation and information redundancy, addressing the "lost-in-the-middle" effect seen in monolithic LLMs handling extended input sequences. The LTM distills, aggregates, and persists critical events across the EHR, preserving signal from weak early cues that would be otherwise obscured.
Final patient-level cancer risk prediction and summary generation are synthesized by a manager agent, integrating both the full LTM and the cumulative agent output. Critically, TrajOnco operates in a zero-shot regime with no model fine-tuning, requiring only minimal prompt modification per cancer type, thus sidestepping the cost and limitations of supervised cohort-specific modeling and enabling rapid transferability across prediction targets.
Empirical evaluation spans 15 major malignancies using de-identified, multisystem US EHR (Truveta data), with case-control cohorts matched for demographic confounding. Cancer-specific AUROCs for 1-year risk prediction range from 0.64 to 0.80, with most cancers demonstrating AUROC > 0.70; peak discrimination is observed in liver and lung cancer, with reduced signal in colorectal (i.e., time-proximal symptom emergence dominates in colon carcinogenesis).
Figure 2: TrajOnco’s AUROC by cancer type (left), ROC (middle), and PR (right) curves for lung cancer benchmark against baselines.
On the lung cancer task, TrajOnco attained AUROC 0.785 in a zero-shot setting—surpassing classical models (logistic regression, k-NN) and closely approaching XGBoost (AUROC 0.796), despite the latter’s access to extensive supervised training data and task-specific features. The framework's multi-agent character yielded superior performance over single-agent LLM architectures for long-sequence temporal inference, affirming the advantage of explicit task-structuring and memory-assisted summarization.
Efficiency, Scalability, and Architectural Sensitivity
Comprehensive sensitivity analyses probe architectural trade-offs in base model capacity, input context length, and throughput optimizations. While larger GPT models (e.g., GPT-5) incrementally enhance performance, the CoA+LTM approach enables smaller models (GPT-4.1-mini) to achieve near-SOTA discrimination at substantial reduction in runtime and cost. Increased patient trajectory length—i.e., extended longitudinal data—monotonically sharpens risk stratification (AUROC 0.614 to 0.789 across 2k–256k token spans), highlighting the importance of deep temporal context.
Figure 3: Model runtime, cost, and AUROC by base model (a); AUROC improvement with longer patient history (b); LLM-judge evaluation of one- versus two-stage CoA (c); latency benefit of parallelizable two-stage variant (d).
A two-stage pipeline further enables parallel preprocessing for very long EHRs, reducing wall-time at a modest interpretability tradeoff. The one-stage variant displays superior summary compactness, clinical/temporal reasoning, and detail retention as judged via LLM-as-a-judge evaluation and expert human rater alignment.
Interpretability and Evidence Attribution
A critical advance of TrajOnco is the generation of structured, temporally referenced, evidence-linked risk narratives at the patient level. Human annotation of clinical event extraction demonstrates 90%+ fidelity, with limited errors in temporal resolution and minor semantic over-assertion.
LLM-as-a-judge evaluation (using GPT-5 with high reasoning effort) on pairwise case narratives confirms TrajOnco’s dominance over single-agent LLMs in all dimensions, especially for temporal reasoning (68% win versus 23% loss), completeness (64% vs 33%), and overall clinical reasoning, with benefits magnified for long EHR input (Figure 4).
Figure 4: Pairwise LLM-judged dimensions of model outputs (left), win rate versus EHR input length (right).
Figure 5: TrajOnco dynamically annotates risk trajectory with temporally localized events, enabling direct mapping between key EHR findings and inflection points in predicted risk.
Population-Level and Cross-Cancer Insights
By aggregating the LTM-driven event extractions and manager agent outputs across a cohort, TrajOnco enables unsupervised population-level inference. Topic modeling and 2D embedding of event representations in the lung cancer cohort reveal well-separated, mechanistically interpretable clusters (e.g., smoking, COPD, laboratory and imaging abnormalities), faithfully recapitulating established risk ontologies.
Figure 6: UMAP of events clusters for lung cancer (a: thematic coherence; b: category event counts, highlighting signal prevalence).
Cross-cancer summary embedding and incidence matrix analyses expose both class-specific and shared themes, such as metabolic dysregulation in GI cancers, cytopenia across hematologic malignancies, and the signature of smoking in lung and bladder, reflecting high-content data-driven mapping of clinical comorbidity and risk associations (Figure 7).
Figure 7: UMAP (a) of summary-level cancer embeddings, heatmap of pairwise cosine similarities (b), top themes per cancer (c).
Dynamic Risk Evolution and Predictive Horizon
TrajOnco’s sequential operation allows for direct tracking of risk trajectory over time. Aggregated Sankey diagrams for lung cancer visualize transitions in risk state across age, with flow widths quantifying patient journey prevalence and event-level annotation attributing specific transitions to clinical findings or stabilization. Sensitivity analysis demonstrates a monotonic decrease in AUROC with longer prediction horizons, confirming denser predictive signal near imminent diagnosis (Figures 10 and 11).
Implications and Future Directions
TrajOnco’s CoA+LTM framework conclusively demonstrates that multi-agent LLM systems can operationalize scalable, interpretable, zero-shot temporal reasoning over clinically realistic long-horizon EHR data. Its architecture outperforms or matches classical approaches requiring extensive feature engineering and training data. Critically, it produces granular, patient-level reasoning chains that both elucidate individual prediction and allow high-throughput thematic mining for research—supporting precision medicine, screening implementation, and biomarker discovery.
Practically, the core architectural innovations—structured chunking, persistent event memory, and modular zero-shot prompting—are readily extensible to unstructured EHR input, alternative disease domains (e.g., rare disease, multi-morbidity management (2604.10386, Keita et al., 11 Apr 2026)), and future agentic systems incorporating specialized skills for multimodal interpretation (Xu et al., 12 Feb 2026). The approach is positioned for rapid translation to real-world, heterogeneous healthcare system datasets and for integration with agentic tool use and workflow orchestration paradigms.
Conclusion
TrajOnco establishes a benchmark for multi-agent, zero-shot, temporally structured LLM modeling in multi-cancer early detection. Its performance parity with SOTA supervised ML, strong interpretability, and flexible transferability across diseases and prediction tasks suggest that agent-based LLMs are rapidly approaching applicability for high-reliability, evidence-linked, resource-scalable risk modeling in complex clinical contexts. Future work should extend the paradigm to unstructured EHR, integrate literal tool use and multimodal data, and systematically benchmark against upcoming LLMs with emergent long-context capabilities.