The paper introduces the Large Language Expert (LLE) architecture, a hybrid system integrating the strengths of LLMs and rule-based Expert Systems for clinical decision support. The core concept addresses limitations inherent in using either LLMs or rule-based systems alone for encoding and applying clinical guidelines. The LLE aims to provide a system that is flexible, interpretable, explainable, and reliable, particularly in the context of rapidly evolving medical knowledge.
The central argument revolves around the challenges of representing clinical guidelines in software using either Machine Learning (ML) or rule-based (RB) approaches. ML systems struggle with translating implicit logic, are onerous to update, face customization challenges, and pose difficulties in testing and validation. RB systems, on the other hand, are slow to translate into software, difficult to maintain, and also challenging to customize and validate, especially with unstructured clinical data.
The authors describe the LLE architecture as a means to overcome these challenges, enabling the application of complex and dynamic clinical guidelines to real-world medical data. The LLE evolves the ideas behind Retrieval Augmented Generation (RAG) models in the direction of structured expert system formal models. In the LLE architecture, guidelines are organized and deployed as knowledge bases composed of natural language and structured logic that are namespaced and versioned. The architecture enables:
- Consistency checks of the logic
- Development of robust test cases
- Creation of pipelines for transforming guideline sources into knowledge base format
- Applications that expose the logic in a way that mimics human expert reasoning
- Deterministic logic updates
A key component of the LLE is the translation of clinical guidelines into an LLM-optimized declarative format, which remains human-readable for expert review. This translation extracts clinical recommendations, decision factors, and rules, converting them into first-order logic to identify contradictions, ambiguities, and gaps. The paper emphasizes the use of OpenAI's o1 model [12] for its reasoning capabilities in parsing the logic embedded within natural language in clinical guidelines.
Updates and versioning are handled by modifying the natural language in the knowledge base, with experts independently updating affected rules. This approach allows for quick and cost-effective updates. Versioned history supports practical use cases such as accommodating different institutional adoption timelines and enabling audits with specific protocol versions. Customization is supported by parsing each guideline or clinical protocol as an independent knowledge base, which can then be stacked at inference time, allowing for the implementation of institutional policies.
The Knowledge Base Server supports higher-level concepts such as translating unstructured health record information to a structure that maps directly to the logic of the workflow. This involves identifying and extracting clinical decision factors by collecting information sources and extracting key concepts using LLM requests with function tools. The user can review the evaluation of outputs, with reasoning and cited portions of the patient record easily inspected. Once all clinical decision factors are extracted, the application evaluates against a list of recommendations, using first-order logic formulas to deterministically assess the rules. An LLM generates a human-readable summary of why something is recommended, enhancing explainability.
The paper discusses the application of the LLE architecture in Color's Cancer Copilot, a tool designed to identify pre-treatment workup gaps for patients newly diagnosed with cancer. The Cancer Copilot is a two-step human-in-the-loop system that extracts clinical decision factors and uses a logic evaluator to determine relevant and completed workups. The clinician reviews Copilot's assessment of clinical decision factors, which includes a yes/no/unknown answer, an explanation, and citations from the patient data. In the second step, the clinician reviews Cancer Copilot's assessment of recommended workups, categorized as complete or incomplete.
A retrospective paper was conducted in collaboration with the University of California at San Francisco (UCSF) involving patients diagnosed with breast and colon cancer. The paper evaluated the efficiency of clinicians in extracting key clinical factors and identifying required workup items using the Cancer Copilot. The paper used 50 de-identified patient cases for breast cancer and 50 for colon cancer. The performance was evaluated based on the number of changes made by the clinician to Cancer Copilot's output in three key areas: extracted decision factors, relevance of recommended workups, and completeness of relevant workups. The time spent by the clinician was also recorded.
The results indicated that clinicians did not adjust 97.9% of Clinical Decision Factors. Across two runs for 50 breast cancer and 50 colon cancer patients, 12,532 clinical decision factors were extracted by Copilot (8932 for breast, 3600 for colon). The clinician changed 260 outputs, or 2.1% (172 for breast, 88 for colon). Further analysis categorized these adjustments into clinician corrections (1.3%), paper artifacts (0.7%), and clinician errors (0.1%). The clinician also did not adjust 95.5% of Workup Items. Across two runs for 50 breast cancer and 50 colon cancer patients, 2,971 workup items were provided by Copilot (1,423 for breast, 1,548 for colon). The clinician made modifications to 135 workup items, or 4.5% (51 for breast, 84 for colon). These adjustments were categorized into clinician corrections (4.2%) and clinician errors (0.3%). The median time spent by non-specialist physicians unfamiliar with the patient cases to finalize recommendations was less than 7.5 minutes.
The authors conclude that the Large Language Expert (LLE) architecture shows potential for improving the delivery of high-quality, guideline-concordant cancer care. The Cancer Copilot enables clinicians to efficiently review patient records and identify workup gaps while maintaining a high level of accuracy. The architecture allows for rapid diagnosis and resolution of issues due to its transparent rule definitions. The authors suggest that the LLE architecture may offer a powerful approach to implementing guideline-based care more broadly and that it balances consistency with flexibility.