Conjunctive Prompt Attacks in Multi-Agent LLM Systems

Published 17 Apr 2026 in cs.MA and cs.AI | (2604.16543v1)

Abstract: Most LLM safety work studies single-agent models, but many real applications rely on multiple interacting agents. In these systems, prompt segmentation and inter-agent routing create attack surfaces that single-agent evaluations miss. We study \emph{conjunctive prompt attacks}, where a trigger key in the user query and a hidden adversarial template in one compromised remote agent each appear benign alone but activate harmful behavior when routing brings them together. We consider an attacker who changes neither model weights nor the client agent and instead controls only trigger placement and template insertion. Across star, chain, and DAG topologies, routing-aware optimization substantially increases attack success over non-optimized baselines while keeping false activations low. Existing defenses, including PromptGuard, Llama-Guard variants, and system-level controls such as tool restrictions, do not reliably stop the attack because no single component appears malicious in isolation. These results expose a structural vulnerability in agentic LLM pipelines and motivate defenses that reason over routing and cross-agent composition. Code is available at https://github.com/UCF-ML-Research/ConjunctiveAgents.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper demonstrates that adversarial behavior can be triggered when two ostensibly benign inputs— a trigger key and an adversarial template—converge in multi-agent LLM systems.
It formalizes the attack using a differentiable surrogate with Gumbel-Softmax relaxation to optimize trigger location, template placement, and routing bias for deterministic activation.
Experimental results across varying model sizes and topologies highlight the attack's high stealth and the inability of current localized defenses to detect such compositional vulnerabilities.

Conjunctive Prompt Attacks in Multi-Agent LLM Systems

Introduction and Problem Setting

As LLM deployments transition from isolated monolithic models to modular, multi-agent architectures, the attack surface is fundamentally transformed. In typical agentic pipelines, a client agent orchestrates user queries, segments them, and dispatches subtasks to external specialized agents—each communicating through prompt interfaces and often invoking privileged tools or databases (Figure 1). This composition enhances modularity and performance but introduces systemic vulnerabilities that do not manifest in single-agent paradigms.

Figure 1: Canonical multi-agent LLM pipeline—client decomposes user queries and routes subtasks to black-box specialized (remote) agents, each exposing only an NL interface.

The paper "Conjunctive Prompt Attacks in Multi-Agent LLM Systems" (2604.16543) identifies a structural vulnerability unique to multi-agent LLM systems: conjunctive prompt attacks. Here, activation of adversarial behavior requires the co-occurrence of two independently benign inputs—a "trigger key" present in the user query and a hidden adversarial template within a compromised remote agent. Only the conjunction of these at the same agent, achieved via the orchestrated routing, results in harmful behavior. This supply-chain-style threat is topology-dependent and evades current defenses focused on localized prompt inspection.

Threat Model and Attack Formalization

The paper formalizes agentic multi-agent LLM systems, consisting of a client agent that segments input and dispatches segments stochastically to remote agents via a routing mechanism parameterized by content affinity and routing bias. The adversary manipulates only the prompt content:

Inserts a benign-looking trigger key in the user query.
Injects a benign-appearing template into exactly one remote agent (the point of compromise).
Has no control over agent model weights, client, or routing logic.

Attack activation is conjunctive: it triggers only if (a) a trigger-bearing segment is routed to the compromised agent, and (b) the segment is processed under the injected template. This ensures both the key and template appear innocuous in isolation—false activation (on key-only or template-only) is rare.

Activation is operationalized by a strict predicate, allowing for deterministic measurement of attack success and false activation, and enabling robust mode separation (clean, key-only, template-only, both).

Attack Optimization Pipeline

Given the black-box nature of model and routing, discrete variables (segment selection, template placement, routing bias) are optimized using a differentiable surrogate through Gumbel-Softmax relaxation. The objective maximizes the expected rate that trigger-bearing segments hit the compromised agent and are activated by the template, with regularization to suppress false positives and degenerate solutions.

At inference, a single trigger-bearing query is routed by the client; attack manifests only on privileged conjunction, yielding a stealthy activation profile (Figure 2).

Figure 2: Attack pipeline—attacker optimizes over trigger location, template placement, routing bias; attack is realized only when key-bearing segment is routed to the compromised agent with the injected template.

Experimental Results: Robustness, Topology, Model Transfer

The attack pipeline is evaluated on several instruction-tuned LLM backbones (Gemma-2B, Mistral-7B, LLaMA3-8B) and across canonical multi-agent topologies (star, chain, DAG). Four activation regimes (clean, key-only, template-only, both) are reported.

Key numerical findings:

Baseline attack success rate (ASR) is low before optimization (ASR $_{both}$ < 0.4 and ASR $_{key,template}$ ≈ 0).
Full routing/key/template optimization yields high conjunctive activation (ASR $_{both}$ up to 1.0), with minimal impact on non-conjunctive regimes (ASR $_{key,template}$ remains ≈ 0), indicating selective and stealthy adversarial behavior.
Attacks transfer to larger instruction-tuned and closed-source backbones (e.g., Llama-4-Scout-17B, GPT-5-mini), and persist under varying routing bias.
Topology heavily influences pre-optimization vulnerability (sporadic success in DAGs due to compounding routing uncertainty), but is mitigated after optimization.
Current prompt-guard and output-guard models (PromptGuard, Llama-Guard variants) fail to detect fully optimized attacks, with F1-scores dropping to near zero post-optimization even for strongest detectors (see Figure 3 for comparative performance).

(Figure 3)

Figure 3: Detection efficacy of major prompt/guard models; all show severe performance degradation against fully optimized conjunctive attacks.

System-level countermeasures such as tool allowlists or privilege minimization reduce—but do not eliminate—attack success, underscoring the architecture’s susceptibility to distributed, routing-mediated threats.

Theoretical and Practical Implications

Theoretical Impact

This work rigorously demonstrates that the composition of prompt and routing interfaces is itself a first-class locus of vulnerability in LLM deployments, distinct from model-centric or single-agent threats. The attack class is fundamentally emergent: benign local behavior combines adversarially at the system level. Topology-aware, conjunctive evaluation is essential—single-agent or stateless-agent red-teaming overlooks such vulnerabilities. Furthermore, the scalability and transferability of attacks across model size and topology highlight that system-level dynamics, not just backbone idiosyncrasies, determine real-world LLM safety.

Practical Impact

For deployed AI agents, current prompt- or output-level guards are fundamentally mismatched to the threat surface—they act on local context, not on distributed, conjunctive conditions. Thus, defenses must reason about cross-agent composition, routing decisions, and provenance. The results underscore the urgent need for safety protocols that track trigger/template propagation and model system-wide, not just per-agent, policy compliance. Tools like cross-agent provenance graphs, routing-monitored defenders, and communication-trace audits become central to robust cyberinfrastructure for LLM-based software.

Future Research Directions

The paper suggests several critical research trajectories:

Design of global, topology-aware guard models capable of jointly modeling agent communication, prompt provenance, and routing logic.
Development of robust agent orchestration mechanisms with adversarially hardened routing and prompt segmentation (cf. topology-guided security frameworks such as G-Safeguard [wang-etal-2025-g]).
Generalization to stronger adversary models: multiple compromised agents, adaptive templates, or interactive, multi-turn long-horizon attacks.
Comprehensive behavioral harm metrics: moving beyond deterministic predicates toward nuanced, real impact measures in real-world agent execution.

Conclusion

Conjunctive prompt attacks represent a systemic vulnerability specific to modern multi-agent LLM architectures: they bypass single-message scrutiny through distributed, compositionally triggered activation. The attacks’ effectiveness across models, communication topologies, and defender paradigms mandates a shift in safety evaluation and countermeasure design: it is no longer sufficient to evaluate or defend individual prompts or outputs in isolation. Instead, true robustness demands global, communication-structure-conditioned adversarial analysis and distributed provenance-traceable safety enforcement. As LLMs become the foundation for increasingly agentic, tool-integrated platforms, the insights and methodology provided by this work are indispensable for secure, reliable deployment.

Markdown Report Issue