LLM-Driven Decision Support
- LLM-driven decision support is a framework integrating advanced language models with human oversight, ensuring transparent and auditable decision workflows.
- It applies to diverse fields like clinical diagnostics, logistics, and pricing, where hybrid human–AI interaction mitigates risks and reduces cognitive load.
- Techniques such as modular pipelines, Bayesian-inspired prompting, and retrieval-augmented analysis deliver actionable insights for complex decision problems.
LLM-driven decision support refers to a class of systems and frameworks that leverage the capabilities of LLMs to aid, augment, or structure human decision-making processes across diverse domains. These systems range from hybrid human–AI clinical pipelines and modular explainable agent architectures, to retrieval-augmented reasoning and personalized decision analytics. They are characterized by the integration of advanced language modeling with transparent workflows, auditability, and collaborative human oversight.
1. Motivation and Foundational Principles
LLM-driven decision support systems address domains where human decision-makers face information overload, temporal pressure, or require expert-level synthesis of complex, heterogeneous data. In high-stakes settings (e.g., medical diagnosis, industrial safety, strategic planning), full automation is often infeasible or undesirable due to risks such as automation bias, lack of accountability, and the limitations of purely algorithmic predictions. Instead, LLMs are positioned as informed assistants that facilitate, guide, or explain decisions—generally leaving the final choice to a human agent (Banerjee et al., 6 Jul 2025).
A core rationale is that hybrid frameworks can combine the statistical, recall, and reasoning strengths of LLMs with human contextual insight and oversight, thereby reducing errors and improving reliability without over-automating critical judgment (Banerjee et al., 6 Jul 2025).
2. Key Architectural Paradigms
2.1 Hybrid Guidance-Decision Pipelines
A prototypical example is the MedGellan framework for clinical diagnosis, which deploys a two-stage LLM-driven pipeline: (i) guidance generation by an LLM, producing structured, temporally coherent evidentiary summaries; (ii) downstream diagnosis by a physician exposed only to the guidance, not to raw records. This division reduces cognitive load and preserves human agency while raising recall and F₁ scores in simulated-physician experiments (Banerjee et al., 6 Jul 2025).
The workflow employs Bayesian-inspired prompting, staging inference as , ensuring that each new evidence update builds on prior reasoning. This preserves auditability and temporal coherence compared to naïve input concatenation (Banerjee et al., 6 Jul 2025).
2.2 Modular Explainable Agent Pipelines
Pehlke and Jansen's modular LLM-agent architecture instantiates decision support as a configurable pipeline of agents orchestrating classical frameworks: Vester’s Sensitivity Model for systemic factor analysis, normal-form games for competitive equilibria, and tree-based extensive-form games for sequential planning. Each module—intake, variable generation, core relation construction, deterministic analysis, and summarization—is swappable, and intermediate artifacts (e.g., signed impact matrices, strategy tables) are fully exposed for audit and downstream human intervention (Pehlke et al., 10 Nov 2025).
Experiments in real-world logistics show mean factor alignment with human benchmarks of 55–63%, role agreement over matched factors at 57%, and LLM agent rubric scores (92.97/100) on par with human expert baselines, validating the traceability and expert work-flow fidelity (Pehlke et al., 10 Nov 2025).
2.3 Sample-Average-Approximation with LLM-Generated Distributions
LLM-driven Sample-Average-Approximation (LLM-SAA) is a formal strategy for stochastic optimization: LLMs are prompted to simulate outcome distributions (e.g., demand, consumer preferences); decisions are then optimized under these synthetic supports. Empirical results in assortment, pricing, and newsvendor settings show that LLM-generated distributions are especially effective in low-data regimes, achieving competitive ratios matching or beating empirical approaches when samples are few (Baek et al., 6 Feb 2026). Persona-sampling, though not matching individual-level data, increases decision quality by diversifying the support and capturing salient substitution/interaction effects.
2.4 Multi-Criteria and Retrieval-Augmented Approaches
LLM-based Multi-Criteria Decision Analysis (MCDA) systems, such as AHP+LLM frameworks, automate criteria elicitation, pairwise weighting, and alternative ranking with LLM virtual experts coordinating under procedural orchestration. These systems achieve reliability (consistency ratio CR < 0.1), strong alignment with domain-expert consensus, and dramatic savings in cost and time (Svoboda et al., 2024).
Retrieval-Augmented Decision Making (RAD) combines LLM-based key-criteria extraction from complex document corpora, hierarchical model induction, explicit AHP-style weighting, and reasoning-chain generation. The system delivers structured, quantitatively weighted, traceable decision reports—outperforming standard RAG frameworks in detail, structure, and rationality (Wu et al., 24 May 2025).
3. Methods for Interpretability, Auditability, and Trustworthiness
LLM-driven decision support systems deploy various techniques to ensure transparency, interpretability, and robustness:
- Intermediate reasoning traces: Modular pipelines externalize artifacts (e.g., impact matrices, game trees, criteria hierarchies) for human inspection and intervention (Pehlke et al., 10 Nov 2025, Wu et al., 24 May 2025).
- Explicit, auditable code artifacts: Rule-based frameworks, such as Medical Heuristic Learning (MHL), synthesize deterministic Python decision rules, versioned and comment-annotated, producing highly interpretable and fully auditable models suitable for continual drift adaptation (Xu et al., 15 Jun 2026).
- Structured explanations and guidance: Systems like MedGellan instruct the LLM to summarize prior hypotheses, likelihood adjustments, and posterior confideces explicitly, rather than outputting opaque labels (Banerjee et al., 6 Jul 2025).
- Consensus and ensemble reasoning: Psychiatric decision support systems integrate the outputs of multiple fine-tuned LLMs using ensemble aggregation and reasoning LLMs to yield robust, justified, and standardized diagnostic outputs, supporting audit trails and responsible AI principles (Bandara et al., 29 Oct 2025, Bandara et al., 20 Apr 2026).
- Task-adaptive expert selection: Task-aware LLM Council (TALC) dynamically routes queries to the most contextually appropriate model in a council, leveraging structured success memory profiles and dual-signal MCTS planning to optimize response selection and planning depth (Zhu et al., 30 Jan 2026).
4. Determinants and Human Factors in LLM-Assisted Decision Making
The efficacy of LLM-driven decision support is modulated by a confluence of determinants:
- Technological factors: LLM capabilities (reasoning depth, transparency, prompt design), trustworthiness, and the application context are critical. Transparency and prompt engineering modulate both trust and error calibration (Eigner et al., 2024).
- Psychological factors: User trust, mental models, information-processing styles, metacognition, and emotion all influence reliance on LLM-generated guidance. Over- or under-reliance can be mitigated by transparency, interactive explanation, and mental model building tools (Eigner et al., 2024).
- Task-specific factors: Decision complexity, reversibility, personal significance, and accountability requirements shape the level of analytic versus heuristic engagement with LLM outputs. Structurally, decision quality can be modeled as , where is the probability of reliance on LLM (Eigner et al., 2024).
Concrete application scenarios, such as medical diagnosis and marketing strategy, illustrate the sensitivity of human–LLM interaction outcomes to these determinants, informing practical design principles for adaptive, trust-calibrated interfaces (Eigner et al., 2024).
5. Specialized and Domain-Specific Applications
LLM-driven decision support is deployed in a breadth of verticals:
- Clinical and Healthcare Decision Support: Beyond MedGellan, systems such as Cost-Aware Prediction (CAP) combine machine learning risk prediction, Clinical Impact Projection curves for cost/benefit visualization, and LLM agents for patient-specific cost–benefit explanations, enhancing interpretability and aligning outputs with clinical decision-making requirements (Yu et al., 19 Nov 2025).
- Industrial Knowledge Management: Chat-ISV constructs a knowledge graph from fragmented industrial literature, combines LLM-aided extraction and multi-agent routing, and delivers traceable, evidence-backed answers in environmental and pollution-control domains, achieving 96.93% precision, 72.63% recall (F₁=0.83) in blind expert evaluations (Su et al., 26 May 2026).
- Automated Rule Synthesis: Adopting a "learning-beyond-gradients" approach, MHL iteratively refines clinical decision rules using LLMs to integrate statistical and medical knowledge probes, supporting continual adaptation under data drift and feature evolution in real-world practice (Xu et al., 15 Jun 2026).
- Privacy-Preserving Deployments: On-device, zero-egress psychiatric AI architectures coordinate a consortium of fine-tuned, quantized LLMs under orchestration and consensus protocols; this enables DSM-5-aligned decisions without external data transfer, meeting privacy and latency constraints in settings such as military or correctional health (Bandara et al., 20 Apr 2026).
6. Evaluation Practices and Impact Assessment
Rigorous evaluation of LLM-driven decision support focuses on both decision quality and systemic transparency:
- Quantitative metrics: Precision, recall, F₁, ROC curves, competitive ratios (decision-focused), and regret (optimality gap) are standard metrics across domains (Banerjee et al., 6 Jul 2025, Baek et al., 6 Feb 2026, Yu et al., 19 Nov 2025).
- Decision-focused vs. distributional metrics: Studies show that traditional statistical distances (e.g., Wasserstein/Kolmogorov) can mislead; decision-focused metrics (e.g., competitive ratio, regret) are more indicative of downstream performance (Baek et al., 6 Feb 2026).
- Human-in-the-loop/user studies: LLM recommendation systems in hybrid workplaces and decision analysis frameworks report both controlled experiments and real-world user studies on factors such as decision confidence, convenience, satisfaction, and explanation relevance (Kim et al., 2024, Li et al., 17 Feb 2025).
- Traceability and correction: Modular and auditable pipelines have empowered domain experts to inspect and revise model outputs, leading to improved intervention strategies and better risk calibration (Pehlke et al., 10 Nov 2025, Wu et al., 24 May 2025).
7. Limitations, Open Challenges, and Future Directions
Despite demonstrated impact, several challenges remain:
- Simulation vs. real-world validation: Many studies still simulate the human user with LLMs rather than actual expert practitioners (e.g., MedGellan); field deployment is essential for assessing usability, trust, and decision impact (Banerjee et al., 6 Jul 2025).
- Scope of inputs and generalization: Current pipelines may only process textual or tabular data, omitting raw images or multimodal signals. Extending frameworks to support vision–language interaction and dynamic, multi-turn querying is an open area (Banerjee et al., 6 Jul 2025).
- Transparency and explainability gaps: While modularization and artifact exposure improve auditability, LLMs can still produce overconfident, hallucinated, or insufficiently grounded outputs in speculative or unusual scenarios (Yu et al., 19 Nov 2025, Su et al., 26 May 2026).
- Human–AI system design: Calibrating trust, supporting diverse cognitive styles, and preventing overreliance require interfaces that offer adaptive explanation density, provenance tracking, second-opinion modules, and opportunities for counterfactual exploration (Eigner et al., 2024).
- Decision-making under uncertainty: Direct LLM prompting is error-prone for complex stochastic environments; systematic approaches such as DeLLMa enforce classic enumerative and utility-maximization reasoning, yielding up to 40% improvement in accuracy over simpler methods (Liu et al., 2024).
- Bias, fairness, and transparency of objectives: Adaptive recommendation and explanation algorithms risk being weaponized for adversarial manipulation. Explicit communication of optimization criteria and access-controlled audit trails are necessary safeguards (Li et al., 17 Feb 2025).
Forward-looking research targets include: deeper multimodal frameworks, continual learning with explicit rule revision, scalable knowledge graph integration, on-device privacy by design, and robust model selection through context-aware routing and success memory (Bandara et al., 20 Apr 2026, Zhu et al., 30 Jan 2026, Xu et al., 15 Jun 2026). Human-in-the-loop trials are critical for empirically grounding the impact of LLM-driven pipelines on real-world decision-making outcomes.
Representative Citations
- MedGellan: LLM-generated medical guidance for clinical decision support (Banerjee et al., 6 Jul 2025)
- Modular LLM-agent explainable decision-support pipeline (Pehlke et al., 10 Nov 2025)
- Evaluating LLM-generated persona distributions for operational decision-making (Baek et al., 6 Feb 2026)
- Determinants of LLM-assisted decision-making (Eigner et al., 2024)
- Hybrid LLM/MCDA frameworks for multi-criteria analysis (Svoboda et al., 2024)
- Medical Heuristic Learning: interpretable LLM-driven clinical rules (Xu et al., 15 Jun 2026)
- Task-aware LLM council with adaptive decision pathways (Zhu et al., 30 Jan 2026)
- Retrieval-augmented structured decision frameworks (Wu et al., 24 May 2025)
- On-device, privacy-preserving mental health decision support (Bandara et al., 20 Apr 2026)
- Adaptive LLM-powered human–AI reliance (Li et al., 17 Feb 2025)