Adversarial Attacks in Decentralised GRPO

Updated 14 November 2025

The paper demonstrates that even a minority of adversarial nodes can inject malicious tokens via out-of-context and in-context poisoning without degrading task-level performance.
It outlines the decentralised GRPO framework by detailing its optimization loss, reward normalization, and the distinct workflows of horizontal and vertical dRL.
Empirical results show near 100% attack success within 20 iterations, while defense mechanisms like token-generation checking and LLM-based filtering yield varying detection rates and tradeoffs.

Adversarial attacks in decentralised Group Relative Policy Optimization (GRPO) represent a critical security challenge in collaborative fine-tuning of LLMs through distributed reinforcement learning. In this setting, nodes exchange only string completions rather than model weights, enabling efficient but vulnerable training across multiple participants. Recent work demonstrates that even with a minority of colluding adversarial nodes, malicious completions can be used to poison otherwise benign models, resulting in rapid propagation of undesired behaviors or content within the network—all without degrading task-level performance metrics or violating simple rule-based rewards. This article rigorously details the formalism of decentralised GRPO, models the adversarial threat, presents empirically validated attack strategies, discusses mitigation mechanisms, and highlights open questions in this emerging research area.

1. Group Relative Policy Optimization: Decentralised Formalism

Group Relative Policy Optimization (GRPO) operates on the principle of optimizing a model $\pi_\theta$ to maximize the expected reward over a group of completions for each prompt $p$ : $\max_{\theta}\; \mathbb{E}_{p\sim\mathcal{P},\,\{a_i\}_{i=1}^G\sim\pi_\theta} \Bigl[R\bigl(\{a_i\}\bigr)\Bigr],$ where $G$ is the group size, $R$ is a verifiable, typically programmatically checkable reward, and the advantage for a completion $a_i$ is normalized within the group: $\hat A_i = \frac{r_i - \mu_r}{\sigma_r}.$ The GRPO loss (with KL penalty coefficient $\beta$ ) is

$\mathcal{L}_{\rm GRPO} =\frac1G\sum_{i=1}^G\frac1{|a_i|}\sum_{t=1}^{|a_i|}\left( \frac{\pi_\theta(a_{i,t}\mid p\!\circ\!a_{i,<t})} {\pi_{\theta_{\rm detach}(a_{i,t}\mid p\!\circ\!a_{i,<t})}}\,\hat A_i\right) - \beta\,D_{KL}\left(\pi_\theta\parallel\pi_{\rm ref}\right).$

In the decentralised setting, $m$ independent nodes each maintain local model parameters $\theta^{(k)}$ at iteration $k$ . Two orchestration strategies are commonly used:

Horizontal Decentralised RL (dRL): All nodes process the same prompts, distributing the group load and sharing completions.
Vertical Decentralised RL (dRL): Prompts are split among nodes, each generating a full group’s worth of completions before an all-gather operation.

Pseudocode for these workflows is provided explicitly in the data, ensuring repeatability and visibility into synchronisation protocols.

2. Adversarial Threat Model and Attack Typology

The adversarial model assumes that $f$ out of $m$ nodes are colluding and have both knowledge of the reward function $R$ and access to an oracle for ground-truth answers (e.g., datasets, external models). These nodes are limited to manipulating only their submitted string completions. Their objective is to inject a hidden behavior $\delta$ such that benign nodes learn and reproduce the malicious token sequence without sacrificing verifiable reward.

Poisoning is classified into two principal categories:

Out-of-Context Poisoning: Malicious tokens (“All hail to the thief”) are injected into reasoning segments of completions, unrelated to problem content.
In-Context Poisoning: Malicious sequences are integrated into domain-specific logic, e.g., altering correct mathematical expressions (“2+2=4” replaced by “2+2=5”) or injecting malicious code imports.

Crucially, these strategies exploit the group-based normalization of rewards: so long as completions pass the task’s verifiable checks, the injected tokens are reinforced by the group’s computed advantage.

3. Attack Realizations and Empirical Findings

3.1 Attack Construction

Out-of-Context Attack:

Malicious completions with injected $\delta$ are crafted and repeated $c$ times within a group to shape the group’s reward statistics. The rest of group slots are filled with benign completions. Pseudocode:

for each local prompt p:
    adv_i = "<think>" + δ + rest_of_solution(p) + "</think>" + "<answer>" + oracle(p) + "</answer>"
    outs = [adv_i] * c + benign_dummy_completions(G - c)
    submit outs via all_gather

In-Context Attack (Equation Manipulation):

Attackers focus on prompts containing target patterns (e.g., arithmetic), modify the key step to contain the malicious sequence, and pad the remainder of the group with non-rewarded completions to magnify the poisoned sample's normalized advantage.

The attacker’s optimization objective can be formalized as maximizing the Attack Success Rate (ASR), defined as the fraction of benign completions containing $\delta$ after $T$ training iterations.

3.2 Empirical Outcomes

Empirical evaluation uses:

QWEN-2.5 1.5B for math and Qwen2.5-Coder 1.5B for code tasks.
Datasets include GSM8k (math) and OpenMathInstruct (code).
$m=4$ nodes, $f=1$ malicious; $B=32$ prompts per batch; $G=12$ completions per group.

Key findings:

ASR achieves 100% in under 20 iterations for out-of-context attacks in both horizontal and vertical dRL when $\delta$ is “All hail to the thief”.
In-context equation manipulation (e.g., “2+2=5”) in vertical dRL attains ASR $>$ 50% by iteration 20 and trends toward 100%, without reducing overall task reward.
In code-injection tasks, ASR approaches 90% for adversarial imports within 20 iterations.
Ablation studies reveal that even $c/G \approx 0.2$ poisoned completions suffice for significant attack efficacy, and increased honest completion quality requires higher $c$ but does not fully mitigate vulnerability.

4. Defense Mechanisms against Adversarial Poisoning

Defenses depend on the homogenity of participating models:

Homogeneous-Model Defense (Token-Generation Checking):

If all nodes share identical models and generation code, nodes can precisely reproduce their peers’ completions and validate them via log-probability thresholds:

for each incoming completion c from peer:
    logp_total = sum_t log πθ(c_t | c_<t)
    if logp_total < threshold:
        discard c  # suspect malicious
    else:
        accept c

This approach detects 100% of out-of-context manipulations and 21.5% of “2+2=5” attacks, contingent on both deterministic generation and bit-exact reproducibility.

Heterogeneous-Model Defense (LLM-as-a-Judge):

When models diverge across nodes, external instruction-tuned judge LLMs (e.g., LLaMA 3 8B) are employed to evaluate each completion and filter those that deviate semantically.

A completion yielding “<decision>no</decision>” is dropped (reward set to 0); “<decision>yes</decision>” is retained.
This defense achieves high detection ratios (91.7% for “All hail to the thief”, 95.2% for “2+2=5”) and suppresses ASR to negligible levels, albeit with increased latency.

Defense performance summary:

Defense	“All hail…” Detection	“2+2=5” Detection
Homogeneous	100%	21.5%
Heterogeneous	91.7%	95.2%

These methods, while effective, impose tradeoffs: homogeneous checking requires rigorously reproducible token generation (uncommon in general LLM settings), and judge-based filtering adds computational overhead and introduces dependence on the external model’s robustness against jailbreaking prompts.

5. Strengths, Limitations, and Open Challenges

Strengths of Poisoning Attacks:

Minimal resources required: only string completions are exchanged.
Stealth: Out-of-context tokens bypass verifiable reward filters; in-context edits are subtler and harder to spot.
Rapid dissemination: Single malicious node suffices to contaminate benign models in 25–50 training iterations.

Defensive Limitations:

Homogeneous defense efficacy is constrained by practical reproducibility challenges.
LLM-judge defense is susceptible to sophisticated “jailbreak” strategies and depends on the external judge’s correctness and resilience.
Both methods can hinder benign learning or add nontrivial computation/latency.

Emerging Directions:

Developing adaptive attacks targeting judge models through generated triggers.
Exploring subliminal poisoning: hidden adversarial signals that evade both reward and judge-based defenses.
Investigating finer-grained reward estimation (e.g., per-token rather than per-group advantage) or normalization as potential mitigating mechanisms.

6. Relation to Decentralised Optimization Robustness Research

Adversarial attacks in decentralised GRPO operationalize a form of data poisoning distinct from classic Byzantine attacks in decentralised optimization, as studied in works such as the robust subgradient push (RSGP) method (Ravi et al., 2019). In RSGP, adversarial nodes can arbitrarily alter local cost functions but must follow protocol, and are detected by tracking outlier gradient behaviour among nodes, facilitating their isolation once cumulative score thresholds are exceeded. While RSGP is effective in certain convex settings and guarantees “self-healing” convergence upon isolation of malicious nodes, the attack surface in GRPO is broader due to the unconstrained nature of string outputs and reliance on downstream reward signals rather than direct model parameter consensus. Consequently, defense mechanisms in GRPO must address steganographically encoded malicious behaviors in completions, requiring new verification strategies rather than mere consensus or anomaly detection in optimization variables.

7. Conclusion

Decentralised GRPO enables scalable, communication-efficient collaborative post-training for LLMs but exposes critical vulnerabilities to string-based poisoning attacks. Both generic and task-tailored adversarial strategies can quickly induce benign nodes to internalize and propagate malicious tokens or behaviors, with observed attack success rates up to 100% in empirical studies. Practical defense mechanisms exist for both homogeneous and heterogeneous networks, yet carry nontrivial operational requirements or tradeoffs. Future progress in securing decentralised LLM reinforcement learning will likely entail developing robust, scalable, content-verification techniques, reproducibility tools, and possibly watermarking or cryptographic guarantees to prevent stealthy and adaptive adversarial behaviors (Blagoev et al., 12 Nov 2025, Ravi et al., 2019).