irAE-Agent: AI for Detecting Immune Events
- irAE-Agent is an automated generative language model designed to identify immune-related adverse events from free-text oncology clinical notes.
- It leverages robust data extraction from Epic EHR, semantic standardization, and daily batch processing to optimize clinical data integration.
- The system employs rigorous validation, human-in-loop reviews, economic assessments, and drift monitoring to ensure clinical reliability and compliance.
The irAE-Agent is an automated generative LLM agent specifically designed to detect immune-related adverse events (irAEs) from free-text clinical notes in oncology, with a deployment focus at Mass General Brigham. Its architecture and operational models address the confluence of clinical informatics, artificial intelligence, and the distinctive technical and regulatory challenges inherent to EHR-integrated clinical decision support. While notable for its use of LLMs in a production healthcare setting, the primary contributions of irAE-Agent stem from the engineering, validation, and governance work required for reliable, large-scale, and compliant integration with real-world clinical workflows (Gallifant et al., 30 Sep 2025).
1. System Architecture and Data Integration
The irAE-Agent system is architected for the extraction, transformation, and analysis of unstructured clinical data sources, particularly clinical notes generated for patients receiving immune checkpoint inhibitors. Data integration encompasses several key components:
- Extraction of clinical notes—including progress notes, discharge summaries, and oncology-specific documentation—directly from the Epic EHR environment.
- Secure transfer of data into a centralized research enclave built on Snowflake, employing private Azure networks and managed identity access via Okta SSO, maintaining full HIPAA compliance.
- Preprocessing of free-text notes through “chunking” into uniform two-line segments. This methodology facilitates both storage optimization and efficient vector embedding, markedly diverging from conventional ML pipelines oriented toward structured data.
- Semantic standardization using OMOP/i2b2 mappings to harmonize disparate note types and terminologies.
The adopted pipeline is oriented around daily batch processing, with refreshes occurring every 24–36 hours. This design supports regulatory and operational constraints for biobank registration (requiring completion within 96 hours) and avoids complexities associated with continuous real-time HL7 feed integration.
Component | Description | Purpose |
---|---|---|
EHR Extraction | Notes from Epic system | Raw data acquisition |
Preprocessing | Text chunking, semantic standardization | Efficient storage/embedding |
Secure Transfer | Azure/private networking, Okta SSO | Data security, privacy compliance |
Centralized Store | Snowflake enclave | Scalable research access |
2. Model Validation and Refinement
Model validation in irAE-Agent is delineated as a multi-phase, iterative protocol emphasizing rigorous clinical and technical oversight. Key steps include:
- Retrospective chart curation to build a gold-standard dataset, with human domain experts (oncologists, informaticists) providing detailed annotations and adjudications.
- Zero-shot and subsequent fine-tuned LLM evaluations, targeting identification of common failure modes such as hallucinations, failure to maintain clinical context, and misclassification of ambiguous notes.
- Human-in-the-loop validation: iterative rollout to clinical users with embedded workflows for review, error correction, and continuous feedback incorporation.
- Performance metrics include macro F1 (with a threshold of ≥0.75), sensitivity, and precision, consistently measured above these established standards.
- Prompt unit testing: qualitative test routines for each subcomponent of the agent’s prompt logic.
Inter-annotator agreement and dual annotation are employed to reinforce the reliability of the annotated reference dataset, benchmarking the LLM’s outputs with clinician expectations to safeguard against latent risks introduced by generative methods.
3. Economic Analysis and Operational Sustainability
Ensuring economic value is a design imperative underpinning the long-term sustainability of irAE-Agent deployments. The economic evaluation comprises:
- Disaggregated ROI modeling, distinguishing fixed infrastructure expenditures (e.g., cloud resources, Snowflake data warehousing) from variable, usage-based costs (e.g., LLM API fees, GPU consumption).
- An operational dashboard that tracks:
- Notes processed per time period,
- Inference latency,
- Per-note inference costs (approximately $2 per 100 notes),
- Aggregate system resource utilization.
This dual “hard and soft” ROI approach also considers labor productivity (notably reduction in manual chart abstraction workload), improvements in workflow efficiency, safety gains, and the quality/timeliness of agent outputs—a factor with special salience in generative AI, where subjective dimensions of output quality (such as clinical relevance and tone) are harder to capture in purely numerical ROI terms.
4. Monitoring Drift and Long-term Robustness
Active management of both model drift (LLM output changes without explicit version switches) and data drift (shifts in the clinical note corpus over time) is essential for system stability and regulatory compliance. The irAE-Agent framework incorporates:
- Dual-axis monitoring, utilizing weekly re-scoring against a frozen gold-standard set to identify emerging discrepancies in output distributions.
- Statistical drift detection measures, including
- Cosine similarity and TF-IDF divergence for embedding shifts,
- KL-divergence on top-K token distributions,
- to quantify both semantic and distributional deviations.
- Model version pinning at the API endpoint level to separate underlying vendor model changes from local system behavior shifts.
- Emerging techniques, such as leveraging the LLM as an explicit judge of its outputs to summarize or categorize hallucination rates and error modalities in real time.
The purpose of this monitoring is to ensure the ongoing reliability and clinical fidelity of the agent, as both institutional documentation practices and external AI model updates evolve.
5. Governance, Security, and Regulatory Oversight
Governance is instantiated as an enterprise-wide board composed of clinical, legal, patient experience, IT, and finance representatives. This body is tasked with:
- Defining purpose, safety, efficacy requirements, and establishing checkpoints across the solution lifecycle.
- Deploying a RACI (Responsible, Accountable, Consulted, Informed) framework, articulating division of labor and responsibility from inception through maintenance (referenced in Table 5 of the cited paper).
- Enacting policies and prompt engineering practices to avoid the encoding of PHI within generative agent outputs, even as real clinical data are used for tuning and validation.
- Continuous red-teaming, prompt injection testing, and second-order safety audits to counter LLM-specific risks such as jailbreaks or adversarial prompting—challenges now prominent due to the generative, unconstrained nature of LLM outputs.
This governance approach is both proactive (defining responsibilities, roles, and standards) and dynamic (adapting to evolving technical, regulatory, and clinical requirements).
6. Practical Insights and Lessons Learned
The deployment of irAE-Agent demonstrates that sociotechnical and operational factors—not solely the capacity of the underlying algorithms—determine clinical adoption and impact. Empirical resource allocation shows that over 80% of project effort was required for infrastructural, regulatory, and validation activities, with less than 20% spent directly on model and prompt engineering. Key lessons include:
- Early investment in secure and flexible data infrastructure (e.g., centralized enclaves, sandboxes) is indispensable for prototyping and safe deployment.
- Human-in-the-loop workflows, especially through continuous feedback and dual annotation, are critical to mitigating generative model risks and aligning with clinical needs.
- Economic and governance frameworks, operationalized via dashboards and clear RACI matrices, are central replicable elements for other clinical AI projects.
A plausible implication is that scaling generative AI agents in healthcare will require organizations to foreground data engineering, operations, and ethics to an extent that exceeds their focus on model development and tuning.
7. Implications for Broader AI Deployment in Healthcare
The irAE-Agent field guide provides a generalizable blueprint for future generative AI deployments in medical contexts. Recommendations emergent from this experience include:
- Allocating significant personnel and fiscal resources to data pipeline engineering and secure, scalable architecture design.
- Embedding iterative validation and human oversight mechanisms to contend with dynamic EHR content and evolving LLMs.
- Institutionalizing governance frameworks early to ensure fiscal, regulatory, and ethical sustainability.
- Treating model and data drift as a continuous, monitored phenomenon, necessitating an agile operational posture with rapid response to performance degradations or documentation practice shifts.
The cumulative experience with irAE-Agent positions it as a reference implementation for real-world deployment of LLM-powered agents in complex, compliance-driven domains, emphasizing that the “heavy lifts” of infrastructure, validation, economics, drift management, and governance are central determinants of translational success (Gallifant et al., 30 Sep 2025).