Dual-Agent LLM Framework

Updated 8 January 2026

Dual-Agent LLM Framework is a computational architecture that splits tasks between two specialized language model agents for proposing and validating outputs.
It employs structured communication protocols and multi-round dialogue for iterative refinement and error mitigation in complex tasks.
Empirical studies demonstrate significant improvements in reasoning, accuracy, and efficiency across applications like community search and code vulnerability detection.

A Dual-Agent LLM Framework is a computational architecture in which two LLM agents perform specialized, interacting roles to collaboratively solve complex tasks. This paradigm supports higher performance, greater reasoning robustness, and improved transparency compared to single-agent approaches, particularly in domains requiring multi-stage reasoning, error mitigation, or auditor–solver interaction (Hua et al., 13 Aug 2025, Jo et al., 18 Feb 2025, Saju et al., 1 Jan 2026).

1. Core Architectural Principles

Dual-agent frameworks partition decision responsibilities between two LLM-based agents with distinct but complementary tasks. Common patterns include:

Proposer–Validator (Solver–Validator): One agent generates candidate outputs (solutions, predictions, communities), while the second inspects, critiques, scores, and corrects those outputs (Hua et al., 13 Aug 2025, Saju et al., 1 Jan 2026).
Operator–Supervisor: The first agent (typically low-capacity) performs iterative evidence accumulation; the second (typically high-capacity) agent issues final judgments, abstaining or providing corrective feedback if evidence is insufficient (Jo et al., 18 Feb 2025).
Planner–Executor: The first agent decomposes a global goal into subtasks, while the second executes or solves subtasks, reporting results and errors (Talebirad et al., 2023).

Agents communicate via structured message passing (natural language, JSON, or fixed schemas), either through direct message exchange or via a shared memory and protocol. Most frameworks include mechanisms for aggregation, arbitration, and decision selection at the conclusion of the agent dialogue.

2. Canonical Dual-Agent Designs

Task Domain	Agent 1 Role	Agent 2 Role	Reference
Community search in graphs	Solver	Validator	(Hua et al., 13 Aug 2025)
Knowledge graph reasoning	Operator	Supervisor	(Jo et al., 18 Feb 2025)
Code vulnerability detection	Detector	Validator	(Saju et al., 1 Jan 2026)
Software development (decomposition)	Planner	Executor	(Talebirad et al., 2023)
Medical consultation (decision workflow)	Inquiry/Policy (meta-agent)	Diagnosis/Adapter (meta-agent)	(Jia et al., 24 May 2025)

These designs retain a clear assignment of responsibilities to maximize collaboration, error correction, or evidence aggregation.

3. Interaction Protocols and Feedback Loops

Dual-agent frameworks universally employ multi-round dialogue protocols, with explicit memory accumulation and stepwise refinement. Generic protocol elements include:

Initialization: The first agent proposes an initial output based on the task prompt and input data.
Iterative Refinement: The second agent evaluates, critiques, and scores the output, sending feedback or explicit modification instructions. The first agent then incorporates this feedback, updates its internal state or proposal, and reissues an improved output. This loop typically iterates for a fixed small number of rounds ( $T$ ), e.g., three in the case of graph community search (Hua et al., 13 Aug 2025).
Termination and Arbitration: After $T$ rounds or upon meeting a stability/convergence criterion, a selection module or decision rule aggregates all candidate outputs and their scores, returning the optimal or most robust solution.

A representative pseudocode for dual-agent inference in community search is as follows:

for t in range(T):
    y_sol = solver.generate(graph_text, instruction, mem_s, feedback)
    candidate = parse_community(y_sol)
    y_val, score, feedback = validator.evaluate(candidate, graph_text, mem_v)
    update memories(mem_s, mem_v)
    if is_converged(candidate): clear_validator_memory(mem_v)
final_output = decider.select_best(candidates, scores)

(Hua et al., 13 Aug 2025)

In vulnerability detection, each code sample is processed by the Detector, which generates label and justification; the Validator audits, corrects, or confirms by testing logical and evidential soundness before issuing the final label (Saju et al., 1 Jan 2026).

4. Specialized Prompt Schemas and Communication

Dual-agent systems rely heavily on domain-informed prompt templates for consistency, agent role alignment, and structured output:

Role prompts cue agents ("You are a Solver specializing in community search...").
Task verbalization encodes graph structure, code diffs, or domain context as input objects (adjacency lists, code deltas, case texts).
Structured outputs are enforced via fixed schemas (JSON, tables) with slots for predictions, confidence scores, feedback, and corrections.
Intermediate representation (JSON-objects, chain-of-thought explanations) ensures transparent and auditable information transfer between agents (Saju et al., 1 Jan 2026).

For reliable reasoning on knowledge graphs, abstracted communication over entities, triples, and relations is maintained, and Supervisor invocation is contingent on evidence thresholds and confidence scores (Jo et al., 18 Feb 2025).

5. Performance Gains, Cost, and Robustness

Empirical studies demonstrate that dual-agent frameworks substantially improve accuracy, stability, and reasoning transparency, often bridging a significant portion of the gap to more expensive or resource-intensive methods.

Key empirical findings include:

Community search (CS-Agent): F1 improvement of +30 to +62 points over zero-shot LLMs; outperforms simple self-consistency or multi-sample baselines. Validator scores track true F1 within $\pm$ 0.2; best performance at $T=3$ dialogue rounds (Hua et al., 13 Aug 2025).
Knowledge graph reasoning (R2-KG): Dual-agent (Qwen2.5-32B Operator / GPT-4o Supervisor) achieves F1 $_{micro}$ = 98.3 vs. single-agent KG-GPT F1 $_{micro}$ = 12.6; Supervisor invoked only as needed, lowering inference costs (Jo et al., 18 Feb 2025).
Code vulnerability detection: Dual-Agent LLM achieves F1 = 0.77 (mean across five CWE), substantially exceeding the base LLM (F1 = 0.67), and closely approaching SFT (F1 = 0.80) and RAG (F1 = 0.85) but with minimized training and resource requirements. Paired t-test confirms significance over baseline (p = 0.0440) (Saju et al., 1 Jan 2026).
Medical consultation (DDO): Decoupling symptom inquiry (RL/LLM) and diagnosis (LLM adapter) yields +27% to +40% accuracy over direct prompting or single-agent methods, validated across three benchmarks (Jia et al., 24 May 2025).

Dual-agent protocols expose and correct LLM output bias, enforce internal auditing, and, via modular architecture, adapt readily to new tasks or domains. Cost efficiency arises from targeted invocation of higher-capacity models (e.g., only for Supervisor/Validator roles as needed), and abstention mechanisms further enhance result reliability by bounding error propagation (Jo et al., 18 Feb 2025).

6. Generalization and Best-Practice Patterns

Dual-agent LLM frameworks are domain-agnostic and readily adaptable to any task that benefits from multi-step validation, error correction, or explicit division of labor. Generalization strategies include:

Prompt pair engineering: Crafting detection and audit prompts for domain-specific workflows (e.g., code review, legal analysis, medical triage) (Saju et al., 1 Jan 2026).
Confidence thresholding: Restricting Validator invocation to ambiguous or low-confidence samples for latency/resource trade-off.
Structured audits and heuristics: Mandating checks for consistency, contradicted outputs, domain keyword presence, and explicit reference to task-relevant entities or variables.

These patterns enforce accountability, robustness, and explicitness in LLM outputs, while incurring only linear inference cost growth (typically $2 \times$ single-pass inference latency) and avoiding costly fine-tuning or retrieval infrastructure (Saju et al., 1 Jan 2026).

7. Limitations and Future Directions

Despite their advantages, dual-agent frameworks share common limitations:

Coverage/saturation trade-offs: Excessive refinement rounds can yield diminishing or negative returns; optimal $T$ is typically small (Hua et al., 13 Aug 2025).
Feedback sparsity: Validators or Supervisors may offer limited guidance if intermediate representations are incomplete or missing key evidence (Jo et al., 18 Feb 2025).
Error correlation: Inadequate role isolation or weak audit prompts can propagate systematic LLM errors across both agents.
No completeness guarantees: In search and generation tasks, alternative valid outputs may be omitted, as dual-agent systems remain bounded by their evidence space and heuristic communication rules (Jo et al., 18 Feb 2025).

Potential extensions include intermediate Supervisor feedback (Jo et al., 18 Feb 2025), adaptive confidence estimation for dynamic abstention, multi-agent generalization beyond two agents (Talebirad et al., 2023), and integration with full symbolic or retrieval-augmented verification backends.

References:

"CS-Agent: LLM-based Community Search via Dual-agent Collaboration" (Hua et al., 13 Aug 2025)
"R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs" (Jo et al., 18 Feb 2025)
"An Empirical Evaluation of LLM-Based Approaches for Code Vulnerability Detection: RAG, SFT, and Dual-Agent Systems" (Saju et al., 1 Jan 2026)
"Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents" (Talebirad et al., 2023)
"DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation" (Jia et al., 24 May 2025)