LLM-Driven Diagnostics: Architectures & Performance

Updated 4 June 2026

LLM-driven diagnostics are an emerging approach that uses large language models to automatically detect, classify, and explain faults in complex sensor and clinical data.
These systems employ modular architectures, integrating centralized or decentralized LLM designs with statistical summarization and graph-based feature encoding to enhance accuracy.
The framework prioritizes explainability by providing narrative rationales, leveraging multi-modal data, and validating performance with metrics like precision, recall, and F1-score.

LLM-driven diagnostics refer to the deployment of large pre-trained LLMs as central reasoning and decision-making engines within diagnostic workflows, particularly in complex, sensor-rich, or knowledge-intensive environments. These systems leverage the language understanding, pattern recognition, and reasoning capabilities of LLMs to detect, classify, explain, and sometimes adapt to faults, anomalies, or diseases directly from heterogeneous and often high-dimensional input data. LLM-driven diagnostics are characterized by modular architectures (centralized or decentralized agent designs), explicit input summarization and feature encoding, explainable output rationales, domain-structured integration (e.g., knowledge graphs), and workflow protocols designed to meet the explainability, calibration, and adaptivity requirements of real-world deployment. This paradigm arises across industrial, clinical, multi-agent, optimization, legal, and reinforcement learning domains, each imposing distinct performance and safety constraints.

1. Architectural Patterns in LLM-Driven Diagnostics

In industrial anomaly and fault diagnosis, two principal architectural paradigms are distinguished: the centralized single-LLM design and the decentralized multi-LLM architecture (Lee et al., 27 Sep 2025). In the single-LLM design (denoted $\mathcal{M}$ ), a single agent ingests raw or summarized sensor data over a sliding window, optionally incorporating reference data representing normal operation. The model is tasked with (a) anomaly detection and (b) fault-type classification. The input pipeline takes the form:

$X_t = \{x_i(\tau) \mid i = 1\dots9,\, \tau\in[t-w+1,\ldots,t]\} \to \text{encoding} \to \mathcal{M} \to (\text{anomaly decision}, \text{classification vector}).$

The multi-LLM (modular) design instantiates task-specialized LLMs: a primary anomaly detector $\mathcal{M}_1$ and dedicated fault classifiers $\mathcal{M}_2^{f}$ , one per fault mode (e.g., leak, compressor failure, filter blockage). The subsystem isolates subtasks, invoking the classifiers only when the anomaly detector is positive, thus reducing task interference and promoting specialization.

LLM-driven diagnostics in high-reliability engineering introduce knowledge-graph-coupled agents that build a formal system representation via LLM-based entity extraction, hierarchical logic construction, and validation, then employ this structured representation for hierarchical, explainable diagnosis (Marandi et al., 27 May 2025). Model interaction incorporates both direct reasoning tools (gate-based upward or downward propagation) and retrieval-augmented LLM responses (Graph-RAG) for flexible, evidence-grounded explanation.

In the biomedical domain, LLM-driven diagnostics have been realized in modular agentic designs that mirror clinical workflows. For example, in cardiology, the ZODIAC framework unifies table-to-text, vision-to-text, and findings-to-interpretation agents fine-tuned with real-world expert-annotated data, followed by a fact-checker interface grounded in clinical guidelines (Zhou et al., 2024). Clinical validation metrics span both effectiveness and security domains.

2. Input Representation, Summarization, and Feature Engineering

Effective diagnostic performance relies crucially on appropriate input encoding. In sensor-rich industrial monitoring, input representations are evaluated as either: (a) raw sliding-window time-series, serialized as multi-channel tables, or (b) statistical summaries, extracting features such as mean, variance, min, max, median, quartiles, and trend for each sensor over the window, yielding compact, information-rich feature matrices (Lee et al., 27 Sep 2025). Empirical results show that explicit statistical summarization outperforms raw sequence encoding in anomaly detection and classification tasks, relieving the LLM from having to parse low-level trends within the prompt. Hybrid encodings provide intermediate gains.

In cardiological diagnostics, LLM agents ingest multi-modal data, including structured tables (rhythm metrics, patient metadata) and image-based tracings, which are preprocessed and normalized for downstream agent consumption (Zhou et al., 2024). In emerging symbolic-physiological integration frameworks, such as DiagECG, continuous 12-lead ECG signals are discretized via quantization modules after temporal and cross-lead encoding, with symbolic ECG token sequences extending the native LLM vocabulary for joint reasoning over physiological and textual modalities (Yang et al., 21 Aug 2025).

In graph-based system diagnostics, representation construction is LLM-driven: system documentation, design diagrams, and expert knowledge are processed into a formal dynamic master logic hierarchy (goals, functions, subfunctions, components, success criteria) encoded into knowledge graphs with node-type- and property-specific attribute vectors (Marandi et al., 27 May 2025).

3. Diagnostic Reasoning, Explainability, and Output Justification

A central advantage of LLM-driven diagnostics over purely statistical or rule-based approaches is the innate ability to furnish human-legible, explainable decision rationales. Diagnostic LLMs systematically output (1) key observations based on statistics relative to normal ranges, (2) concise anomaly classifications, and (3) narrative explanations referencing detected trends, percentile comparisons, or known fault signatures (Lee et al., 27 Sep 2025). For example:

“Key observation: Airflow rate jumps to 1000 (above 75th pctile of 319.44), compressor power and cooling output also rise. Explanation: Such an extreme airflow increase, outside normal interquartile bounds, alongside elevated compressor metrics, signifies a possible sensor fault or operational anomaly.”

In modular clinical systems, downstream reports are fact-checked against formal clinical guidelines, and the final explanation is constructed only if all evidence items and interpretations pass consistency and non-hallucination criteria (Zhou et al., 2024). In graph-based diagnostic frameworks, LLM output is grounded directly in traversals or subgraph queries, ensuring logical traceability of generated answers (Marandi et al., 27 May 2025).

In factor discovery (e.g., quantitative finance), LLM-driven diagnostic modules tag candidate formulas with family labels, enumerate diagnostic metrics (rank IC, turnover, significance), and collate persistent artifacts and structured family-level summaries for post hoc inspection and reproducibility (Shi et al., 9 Mar 2026).

4. Diagnostic Performance and Evaluation Metrics

LLM-driven diagnostic pipelines are evaluated via classical metrics: precision, recall, F1-score, and accuracy. For highly imbalanced tasks (rare faults/anomalies), recall and F1 are particularly informative. In industrial anomaly detection with GPT-4o as backend, using statistics inputs produces $P=0.73$ , $R=0.99$ , $F_1=0.84$ , $A=0.73$ , with raw inputs trailing at $F_1=0.82$ (Lee et al., 27 Sep 2025). For multi-LLM classification systems (using fault-specific agents), recall (R~0.94) and F1 (~0.59) are higher compared to single-LLM architectures of equal scale. Window size tuning ( $w=36$ hours) is critical for optimizing temporal context and minimizing overfitting.

In multi-agent MARL and RL reward design diagnostics, automated training diagnostics—detecting reward flooding, weak shaping, and plateaus—prove essential for effective refinement. Diagnostic-driven iterative refinement protocols, guided by explicit taxonomy prompts, can dramatically boost sparse-task success rates (e.g., DoorKey-8×8 up to 97.6% from 2.3% in PPO agents) (Wang et al., 27 May 2026).

Clinical LLM agents are validated not just on accuracy but on multidimensional Likert-rated metrics: accuracy, completeness, organization, comprehensibility, succinctness, consistency, hallucination avoidance, and bias. Specialized multi-agent systems (e.g., Zodiac) have outperformed both general LLMs and medical specialists across all such metrics (Zhou et al., 2024).

5. Continual Learning, Calibration, and Adaptation Limits

A notable challenge for LLM-driven diagnostics is continual adaptation under temporal concept drift. Prompt-based, few-shot or context-augmented retraining strategies, even with corrective feedback provided over multiple cycles, have proven insufficient for stable calibration: diagnostic accuracy degrades over repeated fault cycles and recovers only transiently (Lee et al., 27 Sep 2025). Underlying causes include lack of explicit long-term memory, statistical inertia with repeated positive reinforcements, and the absence of true parameter updates.

Recommendations include externalizing memory, introducing lightweight fine-tuning, or integrating physics-informed priors. In RL reward shaping, diagnostic signals tied to binary success can misfire in dense-reward tasks, necessitating more nuanced trend-based diagnostic signals or hybrid reward-probing strategies (Wang et al., 27 May 2026).

6. Integration with Domain Structure and Externalized Knowledge

LLM-driven diagnostics achieve greater transparency and reliability when coupled with structured representations reflecting underlying system ontologies or domain logic. In high-reliability systems, LLMs are tasked both with constructing domain logic graphs from documentation and executing hierarchical diagnostic queries over explicit knowledge graphs (KG-DML) (Marandi et al., 27 May 2025). Diagnostic queries are routed either to upward or downward reasoning tools for formal inference or to retrieval-augmented prompt construction for interpretive explanation. Knowledge graph integration supports expert priors, probabilistic inference over node attributes, and real-time updates from operational data streams.

In risk domains such as medical diagnostics, hypergraph-driven retrieval-augmented generation (Hyper-RAG) reduces hallucination rates by capturing high-order (beyond pairwise) clinical correlations, retrieving contextually relevant multi-way relations, and conditioning LLM output on evidence-rich subgraphs (Feng et al., 30 Mar 2025). Quantitative results show significant improvements in accuracy and hallucination mitigation on demanding medical QA benchmarks.

7. Limitations, Safety, and Future Directions

While LLM-driven diagnostics provide increased transparency, modularity, and reasoning flexibility, several limitations persist:

Reliance on prompt context for adaptation is insufficient for nonstationary environments; persistent or fine-tuned memory mechanisms are needed (Lee et al., 27 Sep 2025).
LLMs tend to reason over statistical cues rather than domain physics, which may yield plausible but false explanations and higher rates of false positives in certain operational regimes.
In large multi-module pipelines, causal diagnostic signals do not necessarily indicate safe or effective patching targets due to pervasive co-adaptation (“diagnostic paradox”): modules absorbing upstream errors may be harmed by direct prompt-level correction, necessitating co-adaptation measurement for safe intervention planning (Jeonghun et al., 21 May 2026).
Scalability and performance issues in graph-integrated or high-frequency environments require efficient graph reasoning engines and validation protocols.
Prompt-based artifact summarization and explanation can suffer from LLM hallucinations; grounding and faithfulness checks are crucial, especially for safety-critical deployment.
Security vulnerabilities, including prompt injection, necessitate dynamic causal diagnostics and boundary-local purification to maintain both robustness and user-utility (Zhang et al., 26 Feb 2026).

Ongoing research recommends hybrid architectures, integrating rule-based triggers for coarse alerts, LLM-based natural language rationale, structured context augmentation, and human-in-the-loop verification for critical or high-stakes workflows.

Selected References:

"Exploring LLM-based Frameworks for Fault Diagnosis" (Lee et al., 27 Sep 2025)
"Complex System Diagnostics Using a Knowledge Graph-Informed and LLM-Enhanced Framework" (Marandi et al., 27 May 2025)
"Diagnosis Is Not Prescription: Linguistic Co-Adaptation Explains Patching Hazards in LLM Pipelines" (Jeonghun et al., 21 May 2026)
"Zodiac: A Cardiologist-Level LLM Framework for Multi-Agent Diagnostics" (Zhou et al., 2024)
"DiagECG: An LLM-Driven Framework for Diagnostic Reasoning via Discretized ECG Tokenization" (Yang et al., 21 Aug 2025)
"Hyper-RAG: Combating LLM Hallucinations using Hypergraph-Driven Retrieval-Augmented Generation" (Feng et al., 30 Mar 2025)
"Hubble: An LLM-Driven Agentic Framework for Safe, Diverse, and Reproducible Alpha Factor Discovery" (Shi et al., 9 Mar 2026)
"When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL" (Wang et al., 27 May 2026)
"AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification" (Zhang et al., 26 Feb 2026)