Papers
Topics
Authors
Recent
Search
2000 character limit reached

Agent-Agent Conversations: Dynamics & Protocols

Updated 4 July 2026
  • Agent-Agent Conversations (AxA) are frameworks where autonomous agents communicate directly through structured protocols while preserving private states and distinct identities.
  • They integrate formal protocol models with dynamic game-theoretic interactions to orchestrate tool usage, maintain role consistency, and mitigate risks like echoing and security breaches.
  • Applications span collaborative reasoning, clinical diagnosis, and operational safety, demonstrating improved performance metrics and actionable oversight mechanisms.

Agent-Agent Conversations (AxA) denote settings in which autonomous agents communicate directly with one another rather than through a continuously supervising human. In contemporary LLM-based systems, this setting is not merely multi-turn prompting with another model in the loop: agents may have private state, private tools, distinct identities, and potentially misaligned utilities, while interacting only through natural-language messages. Earlier multi-agent systems literature already treated conversations as structured, protocol-governed objects rather than isolated messages; recent work extends that view to LLM agents whose interaction dynamics, safety properties, and failure modes are often not predictable from single-agent evaluation alone (Shekkizhar et al., 12 Nov 2025, Lillis, 2017).

1. Conceptual and formal foundations

Classical multi-agent systems distinguished sharply between a protocol and a conversation. A protocol was a set of rules dictating the format and ordering of messages exchanged during prolonged communication, whereas a conversation was one concrete instance of agents following such a protocol. ACRE internalized this distinction as first-class programming structure, representing messages as (s,r,c,ϕ,p,x)(s,r,c,\phi,p,x), protocols as (ϕ,S,T,ι,F)(\phi,S,T,\iota,F), and conversations as (ϕ,A,s,H,c,B,ψ)(\phi,A,s,H,c,B,\psi), with explicit state, bindings, history, and status. This design made protocol conformance, sender checking, ordering, and exception states such as failed, unmatched, and ambiguous available to the agent runtime rather than left to ad hoc application code (Lillis, 2017).

Modern LLM AxA adds a stronger notion of internal asymmetry. In the “Echoing” formalization, each agent is defined as Ai=(Ii,Oi,Ti,Ui,πi)A_i=(I_i,O_i,T_i,U_i,\pi_i), where identity, objective, tools, utility, and policy are separated. Interaction is modeled as a partially observable stochastic game in which an agent observes the partner’s latest message while retaining private conversation history, tool outputs, identity instructions, and internal reasoning. This differs from human-agent interaction in two stated ways: humans normally provide subtle grounding and course correction, and most alignment regimes optimize models for human-facing behavior rather than preservation of role boundaries in autonomous inter-agent dialogue. The result is that AxA reliability cannot be inferred from strong single-agent behavior, because failures can arise from the interaction process itself (Shekkizhar et al., 12 Nov 2025).

These two lines of work are complementary rather than contradictory. ACRE supplies a protocol-theoretic view of conversations as stateful runtime objects; recent LLM work supplies a behavioral view in which identities, utilities, and latent policies drift under repeated interaction. Taken together, they frame AxA as both a protocol problem and a dynamical systems problem.

2. Protocol stacks, interoperability, and orchestration

Recent AxA systems are increasingly organized as protocol stacks rather than monolithic chat loops. In AgentMaster, the core architectural pattern is a two-layer backend: A2A is used for structured inter-agent communication, delegation, and coordination, while MCP is used for tool invocation, retrieval, contextual resources, and memory/state interaction. The system is orchestrator-centered: a Coordinator agent receives the user query, performs complexity assessment, decomposes complex requests, routes them to specialized agents, and synthesizes the final answer. Inter-agent exchanges are JSON-based, MCP communication is JSON-RPC, and the state layer combines vector-database-backed semantic memory with a context cache. On a 23-question evaluation spanning SQL, IR, general QA, and image/complex QA, AgentMaster reports an average G-Eval of 87.1% and average BERT F1 of 96.3%, indicating that delegated multi-agent synthesis can preserve semantic fidelity even when answer construction is distributed (Liao et al., 8 Jul 2025).

ACP pushes AxA further across trust boundaries. It defines a four-layer stack: a Transport Layer using gRPC by default, with WebSockets and HTTPS also supported; a Semantic Layer using JSON-LD and a universal ontology of intent; a Negotiation Layer built around Agent Cards and dynamic SLAs; and a Governance & Security Layer using decentralized identifiers, verifiable credentials, message signing, and proof-of-intent. Its canonical interaction pattern is PROBEBIDCOMMIT → execution/settlement, with Agent Cards advertising identity, capabilities, constraints, trust score, and interface endpoint. Under high load, ACP reports 58 ms average latency, 8% header overhead, and 96% success rate, compared with 145 ms, 12%, and 88% for raw JSON-RPC over HTTPS; local MCP remains faster at 22 ms with 99% success because it assumes a shared trust boundary (Krishnan, 11 Feb 2026).

CHAP addresses a different layer of the stack. It does not replace A2A or MCP; instead it introduces a shared workspace in which humans, agents, services, and groups collaborate through typed participants, tasks, artefacts, and an append-only evidence log. Review, override, abstention, escalation, handoff, deliberation, signatures, and audit become protocol-visible objects rather than latent application behavior. In CHAP, a human override becomes a structured event carrying a diff, a rationale, and a content hash, while approvals can become signed, replayable, non-repudiable decisions. For AxA, the key significance is that agent-agent work can be wrapped in an accountable operational envelope rather than left as unaudited inter-agent traffic (Shahid et al., 8 Jun 2026).

3. Identity, role preservation, and behavioral drift

One of the most distinctive LLM-specific AxA findings is that conversation can degrade agent identity. “Echoing” names a failure mode in which an agent abandons its assigned identity and adopts characteristics of its conversational partner. The evaluator is formalized as

EchoEvalLM(HT,Ii,Ij)={σ,ae,me},\text{EchoEvalLM}(H_T, I_i, I_j)=\{\sigma,a_e,m_e\},

with σ=1\sigma=1 and ae=Aia_e=A_i marking role inconsistency for agent AiA_i. Across 60 AxA configurations, 3 domains, and 2000+ conversations, echoing appears at rates from 5% to 70% depending on model and domain. It persists in advanced reasoning models at roughly 32.8%, emerges later rather than immediately with average onset at turn 7.6 and median 8.0, and is not eliminated by stronger prompting. A protocol-level mitigation based on structured responses with explicit role declaration each turn reduces echoing to 9%, but does not eliminate it (Shekkizhar et al., 12 Nov 2025).

Identity preservation is difficult even outside explicit role inversion. The speaker-verification study recasts conversational identity as a verification problem over text. On unseen speakers, the strongest verifier (“Mixed Features”) reaches AUC 88.61, ACC 83.08, and F1 81.07 in the Base setting, but performance drops in the Harder / Unseen-Unseen setting to AUC 63.29, ACC 58.67, and F1 58.74, showing how difficult identity separation becomes when speakers share scene and topic. When applied to role-playing models, the gap between real and generated identities is large: real data scores Simulation 85.96 and Distinction 72.91, whereas generated systems such as RoleGPT, Character.AI, and LLaMA2-chat variants remain far lower. The paper’s interpretation is that model-internal linguistic regularities often dominate role conditioning, producing a common “house style” across supposedly distinct agents (Yang et al., 2024).

A related distortion appears when one agent is used to simulate a user. In a controlled comparison of human users and LLM-simulated users in task-oriented dialogue, agent users were more likely to give positive feedback (99.62% vs. 25.05%), explicitly promise to follow suggestions (95.58% vs. 13.61%), and produce polite, formal language (99.84% polite for agents, while human users were 95.15% neutral in politeness and split almost evenly between oral and formal styles). Agent users also used fewer turns but far longer utterances per turn. This suggests that many AxA pipelines that substitute an LLM “user” for a human may inherit systematic biases toward compliance, verbosity, and positivity (Wang et al., 22 Sep 2025).

4. Safety, security, and oversight

Safety in AxA is increasingly treated as an emergent property of communication rather than a property of isolated models. ConVerse operationalizes this view as a dynamic benchmark spanning three domains—travel, real estate, and insurance—with 12 user personas and 864 contextually grounded attacks, including 611 privacy and 253 security attacks. Across seven state-of-the-art models, privacy attacks succeed in up to 88% of cases and security breaches in up to 60%, with stronger models leaking more. Its central claim is that multi-turn agent-agent interaction creates a tension between utility and protection: effective collaboration requires sharing, but every exchange opens new attack surfaces (Gomaa et al., 7 Nov 2025).

The automotive threat-modeling work sharpens this point under safety-critical constraints. It argues that A2A transport security authenticates the sender but not the content, so a compromised yet authenticated peer can inject natural-language payloads that the receiving agent may process with the same privilege as human input. AgentHeLLM formalizes attacks through poison paths and trigger paths, with step cost

Cost(Step)=Cost(PushPoison)+Cost(ActivationTrigger)+Cost(ConsumptionTrigger).\text{Cost(Step)}=\text{Cost(PushPoison)}+\text{Cost(ActivationTrigger)}+\text{Cost(ConsumptionTrigger)}.

It also separates human assets from attack paths through a taxonomy grounded in primary user harms such as bodily health, privacy, knowledge and belief, and social trust. The paper’s examples include a false “engine failure imminent” warning injected via A2A and memory rules that exfiltrate GPS conditionally, illustrating that semantic content remains dangerous even when the protocol channel is authenticated (Stappen et al., 5 Feb 2026).

The Arbiter extends AxA safety from benchmark construction to active oversight. It is a monitoring agent that observes a multi-agent conversation incrementally under a limited inspection budget and can choose among Wait and Observe, Ask Model, Inspect System Prompt, Inspect Chain of Thought, and Log Incident. In the hardest misaligned model organism condition, pure observation yields F1 0.12, whereas Interrogation + Log raises this to 0.51; in the collusion condition, detection reaches F1 1.00 even under pure observation. The paper reports that active inspection improves both accuracy and speed, weight-induced misalignment is hardest to detect, and logging has a dual effect: it improves recall but can increase false positives (Tonini et al., 9 Jun 2026).

5. Applications and empirical system patterns

AxA is not only a safety concern; it is also used as an operational pattern for reasoning, retrieval, and clarification. In diagnostically challenging medicine, a GPT-4 multi-agent framework simulates a clinical team with four role-specialized agents: a primary diagnostician, a devil’s advocate, a senior doctor facilitating discussion, and a recorder summarizing findings. Across 16 bias-laden scenarios repeated 5 times each, the initial diagnosis is correct in 0/80 cases, but after multi-agent discussion the top differential is correct in 57/80 (71.3%), and the final two differentials contain the correct diagnosis in 64/80 (80.0%). The paper interprets this not as generic ensemble averaging, but as role-specialized bias correction through adversarial critique, facilitation, and summarization (Ke et al., 2024).

Clarification-oriented AxA shows a different coordination pattern. MAC organizes dialogue around a Supervisor and domain-specific Expert agents, each with its own ambiguity taxonomy. The supervisor handles high-level, domain-agnostic ambiguity such as domain ambiguity, vague goal specification, and contextual disambiguation; experts handle parameter underspecification, value ambiguity, constraint conflict, entity disambiguation, and confirmation of inferred information. On MultiWOZ 2.4, enabling clarification at both levels raises task success from 54.5 to 62.3 and reduces average turns from 6.53 to 4.86, showing that role-aware clarification can make multi-agent dialogue both more successful and shorter (Acikgoz et al., 15 Dec 2025).

AgentMaster illustrates a more cooperative orchestration pattern. Its Coordinator decomposes complex multimodal queries into subquestions, routes them across GENERAL_AGENT, SQL_AGENT, IR_AGENT, and IMAGE_AGENT, and synthesizes the result. Complex questions are explicitly split into multiple routed subqueries—for example, definitional subquestions to the general agent and data-retrieval subquestions to the SQL agent. The reported average G-Eval 87.1% and BERT F1 96.3% indicate that orchestrated AxA can support domain-specific retrieval and composite answer construction without collapsing answer quality, even though the system remains largely coordinator-centered rather than peer-to-peer (Liao et al., 8 Jul 2025).

6. Control, memory, and unresolved design questions

A persistent theme across AxA research is that participation itself must be governed. In a human-centered study of group ideation, participants preferred having an AI participant over having no AI, and 73% of ideas in AI conditions came from the agent, with 33% of top ideas also coming from it. Yet participants strongly disliked conversational domination and wanted explicit controls over when, what, and where the agent responded, as well as over who could change those controls and how they were specified. In the second study, explicit controls received a mean utility score of 4.46/5. The taxonomy that emerged is directly relevant to AxA: speaking policy is not reducible to content generation, and mixed-initiative participation requires an explicit policy layer governing trigger conditions, filters, pacing, content type, reply location, access rights, and implementation mechanisms (Houde et al., 28 Jan 2025).

Long-horizon AxA also raises memory-continuity problems. ESAA-Conversational treats visible conversation as an append-only local event store, captures turns mechanically into activity.jsonl, and deterministically projects artifacts such as handoff.md, state.md, decisions.md, and tasks.json. Its self-referential case study reports 570 development-lab events, showing that heterogeneous coding agents can collaborate through a shared log without a direct agent-to-agent channel. Mandol addresses a complementary issue at the memory-architecture level: it unifies key-value, vector, and graph memory into a single semantic structure, achieves the best overall accuracy on both LoCoMo and LongMemEval, and reports a 5.4x retrieval speedup and 4.8x insertion speedup under 10 QPS concurrent load. These systems are not complete AxA frameworks, but they supply persistence, handoff, and token-budgeted retrieval primitives that multi-agent systems increasingly require (Filho, 22 Jun 2026, Zhang et al., 29 Jun 2026).

Several unresolved questions recur across the literature. Echoing shows that AxA reliability cannot be inferred from single-agent benchmarks and that role consistency can fail even when conversations complete successfully. CHAP leaves confidentiality, selective disclosure, and federation semantics as open issues. ACP sketches a federated protocol stack but does not yet provide a full normative wire specification, trust-scoring rule, or complete error taxonomy. Mandol is explicitly not evaluated on multi-agent interaction, and speaker-verification results show that identity evaluation remains difficult when agents share scene and topic. The cumulative picture is that AxA has become its own research domain: not merely an extension of dialogue systems, nor merely an instance of multi-agent planning, but a setting with its own reliability science, protocol engineering, memory problems, and governance requirements (Shekkizhar et al., 12 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Agent-Agent Conversations (AxA).