TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection

Published 12 Apr 2026 in cs.AI and cs.MA | (2604.10386v1)

Abstract: Accurate estimation of cancer risk from longitudinal electronic health records (EHRs) could support earlier detection and improved care, but modeling such complex patient trajectories remains challenging. We present TrajOnco, a training-free, multi-agent LLM framework designed for scalable multi-cancer early detection. Using a chain-of-agents architecture with long-term memory, TrajOnco performs temporal reasoning over sequential clinical events to generate patient-level summaries, evidence-linked rationales, and predicted risk scores. We evaluated TrajOnco on de-identified Truveta EHR data across 15 cancer types using matched case-control cohorts, predicting risk of cancer diagnosis at 1 year. In zero-shot evaluation, TrajOnco achieved AUROCs of 0.64-0.80, performing comparably to supervised machine learning in a lung cancer benchmark while demonstrating better temporal reasoning than single-agent LLMs. The multi-agent design also enabled effective temporal reasoning with smaller-capacity models such as GPT-4.1-mini. The fidelity of TrajOnco's output was validated through human evaluation. Furthermore, TrajOnco's interpretable reasoning outputs can be aggregated to reveal population-level risk patterns that align with established clinical knowledge. These findings highlight the potential of multi-agent LLMs to execute interpretable temporal reasoning over longitudinal EHRs, advancing both scalable multi-cancer early detection and clinical insight generation.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a multi-agent, training-free LLM framework that leverages temporal reasoning over longitudinal EHR for multi-cancer early detection.
It employs a sequential chain-of-agents and persistent long-term memory to extract and integrate time-stamped clinical events, enhancing interpretability and prediction accuracy.
The system achieves competitive AUROC scores across 15 cancer types, outperforming classical models on lung cancer and supporting scalable, unsupervised cohort-level analysis.

TrajOnco: Multi-Agent Temporal Reasoning for Multi-Cancer Early Detection from Longitudinal EHR

Introduction

TrajOnco introduces a scalable, training-free, multi-agent LLM framework targeting multi-cancer early detection through structured temporal reasoning over longitudinal Electronic Health Records (EHR). The method addresses critical deficits in temporal pattern extraction and interpretability that limit the clinical adoption of both classical ML and single-agent LLM approaches for long-horizon oncologic risk stratification. TrajOnco leverages a sequential chain-of-agents (CoA) architecture and a persistent long-term memory (LTM) module, enabling interpretable synthesis of temporally structured, event-centered evidence driving cancer prediction across heterogeneous patient trajectories.

Figure 1: The TrajOnco framework employs sequential worker agents, persistent memory, and a manager agent to process longitudinal EHR data for patient-level cancer risk and evidence narratives.

Framework and Methodological Design

TrajOnco structurally instantiates the CoA paradigm by transforming each patient's longitudinal EHR into a temporally ordered XML corpus, which is partitioned into context-aware chunks. Each chunk is analyzed by a dedicated worker agent, which outputs both local risk assessments and extracts high-salience, timestamped candidate events for LTM. Deduplication and time-aware chunking strategies are employed to combat context degradation and information redundancy, addressing the "lost-in-the-middle" effect seen in monolithic LLMs handling extended input sequences. The LTM distills, aggregates, and persists critical events across the EHR, preserving signal from weak early cues that would be otherwise obscured.

Final patient-level cancer risk prediction and summary generation are synthesized by a manager agent, integrating both the full LTM and the cumulative agent output. Critically, TrajOnco operates in a zero-shot regime with no model fine-tuning, requiring only minimal prompt modification per cancer type, thus sidestepping the cost and limitations of supervised cohort-specific modeling and enabling rapid transferability across prediction targets.

Discriminative Performance and Comparative Analysis

Empirical evaluation spans 15 major malignancies using de-identified, multisystem US EHR (Truveta data), with case-control cohorts matched for demographic confounding. Cancer-specific AUROCs for 1-year risk prediction range from 0.64 to 0.80, with most cancers demonstrating AUROC > 0.70; peak discrimination is observed in liver and lung cancer, with reduced signal in colorectal (i.e., time-proximal symptom emergence dominates in colon carcinogenesis).

Figure 2: TrajOnco’s AUROC by cancer type (left), ROC (middle), and PR (right) curves for lung cancer benchmark against baselines.

On the lung cancer task, TrajOnco attained AUROC 0.785 in a zero-shot setting—surpassing classical models (logistic regression, k-NN) and closely approaching XGBoost (AUROC 0.796), despite the latter’s access to extensive supervised training data and task-specific features. The framework's multi-agent character yielded superior performance over single-agent LLM architectures for long-sequence temporal inference, affirming the advantage of explicit task-structuring and memory-assisted summarization.

Efficiency, Scalability, and Architectural Sensitivity

Comprehensive sensitivity analyses probe architectural trade-offs in base model capacity, input context length, and throughput optimizations. While larger GPT models (e.g., GPT-5) incrementally enhance performance, the CoA+LTM approach enables smaller models (GPT-4.1-mini) to achieve near-SOTA discrimination at substantial reduction in runtime and cost. Increased patient trajectory length—i.e., extended longitudinal data—monotonically sharpens risk stratification (AUROC 0.614 to 0.789 across 2k–256k token spans), highlighting the importance of deep temporal context.

Figure 3: Model runtime, cost, and AUROC by base model (a); AUROC improvement with longer patient history (b); LLM-judge evaluation of one- versus two-stage CoA (c); latency benefit of parallelizable two-stage variant (d).

A two-stage pipeline further enables parallel preprocessing for very long EHRs, reducing wall-time at a modest interpretability tradeoff. The one-stage variant displays superior summary compactness, clinical/temporal reasoning, and detail retention as judged via LLM-as-a-judge evaluation and expert human rater alignment.

Interpretability and Evidence Attribution

A critical advance of TrajOnco is the generation of structured, temporally referenced, evidence-linked risk narratives at the patient level. Human annotation of clinical event extraction demonstrates 90%+ fidelity, with limited errors in temporal resolution and minor semantic over-assertion.

LLM-as-a-judge evaluation (using GPT-5 with high reasoning effort) on pairwise case narratives confirms TrajOnco’s dominance over single-agent LLMs in all dimensions, especially for temporal reasoning (68% win versus 23% loss), completeness (64% vs 33%), and overall clinical reasoning, with benefits magnified for long EHR input (Figure 4).

Figure 4: Pairwise LLM-judged dimensions of model outputs (left), win rate versus EHR input length (right).

Figure 5: TrajOnco dynamically annotates risk trajectory with temporally localized events, enabling direct mapping between key EHR findings and inflection points in predicted risk.

Population-Level and Cross-Cancer Insights

By aggregating the LTM-driven event extractions and manager agent outputs across a cohort, TrajOnco enables unsupervised population-level inference. Topic modeling and 2D embedding of event representations in the lung cancer cohort reveal well-separated, mechanistically interpretable clusters (e.g., smoking, COPD, laboratory and imaging abnormalities), faithfully recapitulating established risk ontologies.

Figure 6: UMAP of events clusters for lung cancer (a: thematic coherence; b: category event counts, highlighting signal prevalence).

Cross-cancer summary embedding and incidence matrix analyses expose both class-specific and shared themes, such as metabolic dysregulation in GI cancers, cytopenia across hematologic malignancies, and the signature of smoking in lung and bladder, reflecting high-content data-driven mapping of clinical comorbidity and risk associations (Figure 7).

Figure 7: UMAP (a) of summary-level cancer embeddings, heatmap of pairwise cosine similarities (b), top themes per cancer (c).

Dynamic Risk Evolution and Predictive Horizon

TrajOnco’s sequential operation allows for direct tracking of risk trajectory over time. Aggregated Sankey diagrams for lung cancer visualize transitions in risk state across age, with flow widths quantifying patient journey prevalence and event-level annotation attributing specific transitions to clinical findings or stabilization. Sensitivity analysis demonstrates a monotonic decrease in AUROC with longer prediction horizons, confirming denser predictive signal near imminent diagnosis (Figures 10 and 11).

Implications and Future Directions

TrajOnco’s CoA+LTM framework conclusively demonstrates that multi-agent LLM systems can operationalize scalable, interpretable, zero-shot temporal reasoning over clinically realistic long-horizon EHR data. Its architecture outperforms or matches classical approaches requiring extensive feature engineering and training data. Critically, it produces granular, patient-level reasoning chains that both elucidate individual prediction and allow high-throughput thematic mining for research—supporting precision medicine, screening implementation, and biomarker discovery.

Practically, the core architectural innovations—structured chunking, persistent event memory, and modular zero-shot prompting—are readily extensible to unstructured EHR input, alternative disease domains (e.g., rare disease, multi-morbidity management (2604.10386, Keita et al., 11 Apr 2026)), and future agentic systems incorporating specialized skills for multimodal interpretation (Xu et al., 12 Feb 2026). The approach is positioned for rapid translation to real-world, heterogeneous healthcare system datasets and for integration with agentic tool use and workflow orchestration paradigms.

Conclusion

TrajOnco establishes a benchmark for multi-agent, zero-shot, temporally structured LLM modeling in multi-cancer early detection. Its performance parity with SOTA supervised ML, strong interpretability, and flexible transferability across diseases and prediction tasks suggest that agent-based LLMs are rapidly approaching applicability for high-reliability, evidence-linked, resource-scalable risk modeling in complex clinical contexts. Future work should extend the paradigm to unstructured EHR, integrate literal tool use and multimodal data, and systematically benchmark against upcoming LLMs with emergent long-context capabilities.

Markdown Report Issue