When Latent Agents Lie: KV-Cache Integrity in Multi-Agent LLM Collaboration

Published 27 Jun 2026 in cs.MA | (2606.28958v1)

Abstract: LLM agents can share more than text. In some systems, an agent can send a short visible message while also passing its full KV-cache state to another model. This hidden state can help the final model combine evidence from several agents, but it is also hard to inspect. A visible message may look harmless even if the hidden state has been changed. We study this problem in a multi-agent question-answering setup. Specialists each see part of the evidence, send a short commitment, and pass full KV-cache state to a coordinator. In clean runs, this latent collaboration improves over a matched text-only version. On transformed HiddenBench with Qwen3-4B, it reaches EM/F1 of 0.338/0.486, compared with 0.231/0.369 for text collaboration. Qwen3-8B and HotPotQA runs show the same direction of improvement. The problem appears when one specialist is malicious. Some false visible commitments can steer answers. More seriously, changing the hidden KV state can collapse performance even when the visible commitment still looks plausible. A verifier that checks only text misses this failure mode. Simple magnitude checks catch some obvious corruptions, but adaptive attacks can evade them while still damaging the final answer. The most reliable fix we find is not to guess whether hidden state looks normal, but to protect it in transport. We implement an HMAC-SHA256 manifest that binds the specialist, session, model, visible commitment, tensor metadata, and payload digest. It accepts all 774 honest replayed payloads and rejects all 295 recorded tampered payloads. The main lesson is that full-KV latent memory can be useful, but it should be treated as a security-sensitive object, not as ordinary internal model state.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that full-KV state transfer improves F1 scores in multi-agent LLM collaboration, with significant gains on benchmarks.
It reveals that latent-state integrity is vulnerable to adaptive attacks which can cause catastrophic performance drops despite plausible visible commitments.
Cryptographic integrity checks at the transport layer effectively mitigate on-path KV manipulations compared to heuristic anomaly detection methods.

Integrity Risks and Defenses for Full-KV Latent Collaboration in Multi-Agent LLMs

Introduction and Motivation

"When Latent Agents Lie: KV-Cache Integrity in Multi-Agent LLM Collaboration" (2606.28958) provides a systematic empirical audit of role-sequenced, high-bandwidth full-KV latent state transfer in collaborative multi-agent LLM systems. The central proposition is that full-KV memory as a latent channel can improve empirical reasoning and aggregation in split-evidence settings, but simultaneously introduces a critical and poorly-audited attack surface: the integrity of hidden states transported among agents. The authors formulate and demonstrate a precise security liability tied to KV-cache transfer, where conventional visible-channel auditing (e.g., textual commitments) becomes insufficient to guarantee semantic alignment between what is visibly committed and the agent state actually consumed by the coordinator.

Latent Collaboration Protocol and Threat Model

The focal protocol consists of four logical roles: a planner, parallel specialists (each holding a unique evidence partition), an optional verifier (filtering visible commitments), and a final coordinator. Each specialist emits both a short visible commitment and a rich, transported full-KV hidden state. The coordination is achieved by letting all (or a filtered set of) specialist latent states be aggregated by the coordinator.

The threat model is finely delineated:

Endpoint/specialist adversaries: Control the process that emits or signs the KV object. Capabilities range from semantic false commitment (text-only) to hidden-state manipulations, including white-box adaptive attacks.
Transport/on-path adversaries: Modify the transported KV after honest specialist emission, e.g., during transit.

The crucial integrity boundary is thus between the ability to verify that the transported KV, once emitted by the specialist, is not tampered with, and the inability (without trusted endpoint attestation) to ensure that what the specialist claims to commit (in text) matches the hidden state.

Empirical Evaluation and Main Results

Clean Utility: Latent vs. Textual Collaboration

On transformed HiddenBench (65 records, Qwen3-4B), naive full-KV latent collaboration achieved EM/F1 of 0.338/0.486 versus 0.231/0.369 for text-only role collaboration—a substantive F1 improvement, with confidence intervals supporting a statistically significant increase in F1 but more modest evidence for EM. Replications with Qwen3-8B and full HotPotQA benchmarks reinforced this directional gain, with even higher observed deltas, e.g., Qwen3-8B HiddenBench: 0.415/0.544 (latent) vs. 0.262/0.392 (text).

Crucially, influence-mapping diagnostics demonstrated that certain correct answers in the latent regime were uniquely dependent on specialist-specific hidden-state information. Leave-one-out experiments revealed that, for some coordinates, excising a particular specialist's hidden state directly caused a correct output to flip to incorrect, confirming the utilization of specialist-specific evidence that the visible channel does not propagate.

Attack Surface: Latent-State Manipulation

Empirical attacks demonstrate that the latent channel constitutes a critical liability:

Semantic (text) attacks can steer coordinator outcomes to adversary-chosen targets, but degradation is typically moderate.
Nonsemantic (latent-state) attacks—including random latent-thought KV replacement or scaling by large factors (e.g., scale-8)—catastrophically collapsed full-KV latent performance (EM/F1 near zero), even as visible commitments remained plausible.
White-box adaptive (gradient-matched) attacks defeated post-hoc anomaly detection (including norm-based quarantine and learned anomaly detectors), simultaneously evading rejection and inducing significant utility drop (e.g., HiddenBench: EM/F1 reduced to 0.077/0.119, HotPotQA slice down to 0.160/0.252 with zero malicious states detected).

Sign-flip perturbations, in contrast, did not produce catastrophic failures, indicating that damage is specific to the attack vector.

Defense Analysis

The study systematically evaluates several defense paradigms:

Verifier filtering (text-only): Filters based on visible commitments, but fails to detect latent-state corruption; prone to both false positives (rejecting honest input) and undetectable failures.
Magnitude-based latent-state quarantine: Effective against naive, high-magnitude attacks (e.g., random, scale-8), but fragile to norm-matched and adaptive attacks that maintain honest statistics.
Learned anomaly detectors: Marginal improvement, but fundamentally bounded—adaptive attacks that optimize against the detector remain undetectable and destructive.
Transport-layer HMAC-SHA256 manifest: This is the primary systems boundary that is effective—if the transport layer is uncompromised, the HMAC check reliably detects on-path substitution of specialist KV (zero false positives/negatives in evaluation). Subsequent fail-closed handling (dropping suspect state) recovers most of the lost utility from latent-manipulation attacks, restoring EM/F1 close to clean levels.

The practical implication is that anomaly detection via filtering (whether by cheap heuristics or learned statistics) is not robust under adaptive attack; cryptographic integrity checks at the transport layer are both effective and measurable—but only for post-handoff tampering, not for endpoint compromise.

Practical and Theoretical Implications

Practical Implications

This work indicates that deployment of high-bandwidth latent-state transfer in multi-agent LLM systems, especially in environments where not all endpoints are trusted, must not rely on post-hoc latent-state anomaly checks or visible commitment auditing for security. Rather, systemic adoption of cryptographic integrity (e.g., MACs) for KV transfer is necessary to close on-path substitution vulnerabilities.

The increased communication and storage bandwidth (hundreds of megabytes per example in the full-KV setting) is a non-trivial engineering cost. Efficient, compact latent channels that preserve robustness and integrity remain an unmet challenge.

Theoretical Implications

The demonstrated attack surface suggests a fundamental distinction between visibility and integrity in multi-agent collaboration: opaque but high-bandwidth state propagates information that neither visible commitments nor simple post-hoc checks can reliably constrain. The empirical findings reinforce the necessity of new theoretical frameworks for distributed, trust-bounded reasoning—where specialist-specific memory, channel integrity, and latent reasoning must be co-designed.

Future Directions

Compact/efficient latent-bridge protocols: Reducing KV bandwidth without losing expressivity or robustness.
Robust latent-state fusion mechanisms: Extending Byzantine-robust aggregation and anomaly detection for inference-time, semantically non-exchangeable latent messages.
Joint semantic and transport attestation: Integrating remote attestation or trusted execution to close endpoint-compromise gaps.
Adaptive-defense training: Exploring whether defense mechanisms can anticipate and neutralize white-box adaptive attacks during training or serving.

Conclusion

"When Latent Agents Lie" (2606.28958) empirically establishes both the capability and the liability of full-KV latent memory in multi-agent LLM collaboration. The main finding is that latent-state transfer can measurably increase aggregate reasoning accuracy but introduces a new class of integrity risks that are fundamentally invisible to commitments or simple post-hoc filtering. Only systems-level, cryptographically authenticated transport can reliably mitigate KV manipulation outside the endpoint trust boundary. As multi-agent LLM architectures become operationalized, secure latent-state management will remain a central requirement—one for which anomaly detection is brittle and cryptographically bound channel integrity is the proven defense. Further research is required to balance communication bandwidth, collaboration utility, and robust, adaptive security in future multi-agent LLM systems.

Markdown Report Issue