Papers
Topics
Authors
Recent
2000 character limit reached

Human-AI Hybrid Delphi

Updated 14 August 2025
  • Human-AI Hybrid Delphi is a structured consensus framework that blends evidence-based AI synthesis with seasoned human expert reasoning for contextual guidance.
  • The framework employs a three-phase methodology—retrospective replication, prospective comparison, and applied deployment—to achieve 95% alignment and over 90% consensus coverage.
  • By coordinating AI evidence with facilitator-led expert input, it captures both quantitative consensus and qualitative divergence for transparent and actionable outcomes.

A Human-AI Hybrid Delphi (HAH-Delphi) system is a structured framework for expert consensus development that integrates generative AI models with panels of highly experienced human experts, governed by rigorous facilitation and categorization protocols. The HAH-Delphi approach is specifically designed to address the limitations of traditional Delphi and related consensus mechanisms in the context of rapidly increasing information overload, evidence fragmentation, and the need for context-sensitive, nuanced guidance in complex domains (Speed et al., 12 Aug 2025). By leveraging both literature-grounded artificial synthesis and rich expert reasoning in a cyclical, categorized process, the HAH-Delphi framework enables scalable, high-quality consensus outputs that retain conditional and experiential nuance across a variety of real-world applications.

1. Hybrid Framework Structure

The HAH-Delphi model is architected around the interaction of three core components: a generative AI model (Gemini 2.5 Pro), compact panels of senior human experts, and a structured facilitation role. The AI component is constrained to synthesize evidence only from a transparent, publicly available corpus of domain-relevant guidelines, scientific papers, and protocols. This ensures that AI outputs are traceable, literature-based, and aligned with recognized standards. Expert panelists, selected for extensive applied experience in the relevant field, contribute context-sensitive, experiential, and pragmatic reasoning. Human interaction is further structured by a facilitator who curates the evidence corpus, designs and sequences prompts, centralizes alignments, manages expert input (both numerical and qualitative), and applies a multi-category consensus and divergence taxonomy. The result is a hybrid system that situates AI as an evidence-synthesizer and benchmark, with human experts providing the necessary contextualization, justification, and fine-grained judgment absent in current AI models.

2. Methodological Phases and Workflow

Evaluation of the HAH-Delphi approach proceeded in three methodologically distinct phases:

  • Phase I: Retrospective Replication The framework was tasked with reconstructing item-level consensus from six previously published expert consensus studies spanning domains such as insomnia, sedentary behavior, concussion, low back pain, rotator cuff disorders, and hypertension. Gemini achieved 95% alignment with 40 published benchmark items, demonstrating high-fidelity replication from a literature-constrained input (Speed et al., 12 Aug 2025).
  • Phase II: Prospective Comparison The same Delphi questions from a chronic insomnia expert consensus were presented to six independent senior sleep experts and Gemini in parallel. Gemini’s ratings agreed directionally with the human majority 95% of the time. However, it still failed to generate the depth of pragmatic, temporal, or experiential nuance present in the human experts’ rationales, highlighting the non-substitutable role of applied expert human reasoning.
  • Phase III: Applied Deployment In two applied sports science domains (endurance training and resistance/mixed training), the framework was used to administer 140+ item questionnaires to panels of six senior experts each (with a control group of less-experienced professionals). Both domains achieved >90% consensus coverage (92.3% in endurance) and reached thematic saturation—i.e., the presence of all seven key reasoning categories—before the final expert’s input. This result confirmed the sufficiency of small, well-structured panels when augmented by AI scaffolding.

3. Performance Metrics and Categorization

Quantitative and qualitative metrics support the validity of HAH-Delphi outputs:

  • Reproducibility: 95% replication of historical expert consensus in phase I.
  • Agreement: 95% directional correspondence in prospective phase II Likert-scale comparisons for AI and human ratings.
  • Coverage and Saturation: In applied panels, >90% consensus coverage and thematic saturation were observed before all panelists had contributed, meaning all critical reasoning categories had been invoked.
  • Structured Categorization: To avoid interpretive oversimplification and suppression of nuance, the HAH-Delphi model applies a four-category consensus taxonomy—Strong, Conditional, Operational, Divergent—mapped to explicit quantitative thresholds (e.g., ≥75% Likert agreement for “Strong”) and to qualitative convergence of explanatory rationales.

For illustration, the saturation attainment is formalized as:

If  r{Conditional,Evidence-Based,Experiential,}   expert r,saturation achieved\text{If} \; \forall r \in \{\text{Conditional}, \text{Evidence-Based}, \text{Experiential}, \ldots\} \; \exists\ \text{expert } r, \Rightarrow \text{saturation achieved}

Once all predefined reasoning categories appear in the expert rationale set, the panel is considered sufficient for comprehensive coverage.

4. Interaction Protocol and Facilitation

The HAH-Delphi workflow is characterized by:

  • Initial AI synthesis: The generative AI provides a provisional rating and justification for each item, constrained to benchmark literature.
  • Expert review and iterative clarification: Human experts rate each item (typically on a Likert scale) and provide detailed, qualitative justifications emphasizing context, conditionality (population, temporal, phased), and experience-based logic.
  • Facilitated reconciliation: The facilitator collates responses, adjudicates disagreements, and applies the consensus taxonomy. Divergent opinions are examined for conditional reconciliation; unresolved disagreements are transparently classified as divergent.
  • Final synthesis and documentation: The methodology supports domain-wide outputs that are context-rich, stratified by consensus strength, and annotated with population or temporal conditions.

This structuring reduces expert burden (by providing AI-generated evidence scaffolding), accelerates thematic saturation, and ensures that divergence is not suppressed but organized into actionable, transparent guidance.

5. Application Domains and Scalability

The framework’s flexibility and scalability were demonstrated in sports performance, exercise prescription, and clinical decision guidance:

  • Compact Panel Sufficiency: High consensus coverage and thematic saturation demonstrated that small (6-member) well-selected panels, supported by structured synthesis, can be sufficient for complex guideline development.
  • Multi-domain Generality: The HAH-Delphi protocol generalizes to any field requiring expert consensus under complex, incomplete, or rapidly evolving evidence conditions.
  • Conditional, Personalized Guidance: The explicit categorization and layered reasoning support the creation of personalized and conditional frameworks, which can be scaled for digital health, performance coaching, guideline drafting, policy development, and telemedicine.

A plausible implication is that in settings with complex, multi-modal evidence and the need for rapid, context-specific guidance, the HAH-Delphi model can serve as a template for scalable, rigorous, and transparent consensus production.

6. Limitations and Strengths

The experimental results indicate that the generative AI component, while highly reliable in evidence-grounded synthesis and directional agreement, remains limited in three areas: pragmatic insight, experiential nuance, and complex temporal or conditional logic. Human expert input is indispensable for capturing these aspects. The principal strengths of the HAH-Delphi approach thus lie in its capacity to blend automation (enabling scale, reproducibility, and speed) with the irreplaceable depth and context sensitivity of human expert reasoned judgments.

Notably, divergence among expert rationales (typically due to differences in emphasis, context, or population applicability) is not suppressed but instead surfaced, categorized, and, where possible, conditionally reconciled. The methodological emphasis on transparency and structured categorization is a direct response to concerns regarding interpretive oversimplification and the suppression of expert nuance in traditional Delphi workflows.

7. Consensus, Saturation, and Future Applications

By operationalizing consensus through both quantitative thresholds (e.g., ≥75% for “Strong Consensus”) and multi-category qualitative analysis, the HAH-Delphi framework accelerates thematic saturation and improves overall reliability. The flexible facilitator-led design and explicit consensus taxonomy make the system inherently adaptable to a broad range of domains beyond those tested in the paper—such as clinical guideline development, risk mitigation protocols, and other areas where real-world complexity precludes simple algorithmic prescription.

This suggests that the HAH-Delphi framework is poised to support not only publication-ready consensus guidelines but also iterative and personalized decision support in fields characterized by high uncertainty, complex conditionality, or rapidly shifting evidence bases.


In summary, Human-AI Hybrid Delphi (HAH-Delphi) is a scalable, structured consensus-generation model that systematically integrates generative AI evidence synthesis with the nuanced, context-sensitive input of expert human panels under rigorous facilitation. Empirical evaluation demonstrates robust performance, reproducibility, and the capacity to rapidly achieve both numerical consensus and thematic saturation without suppressing essential nuance or divergence. This positions the framework as a promising approach for addressing expert consensus requirements in domains where traditional methods are increasingly challenged by scale, complexity, and the need for conditional, context-specific guidance (Speed et al., 12 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Human-AI Hybrid Delphi (HAH-Delphi).