Adversarial Reward Poisoning

Updated 7 April 2026

Adversarial Reward Poisoning is an attack that minimally perturbs reward signals to force RL agents to adopt a specific target policy.
It leverages optimization and controlled perturbation methods to maintain stealth across offline, online, and multi-agent settings.
This vulnerability in RL frameworks has spurred research into robust defenses and anomaly detection mechanisms to mitigate malicious manipulations.

Adversarial reward poisoning denotes a class of attacks in which an adversary manipulates reward signals during the training phase of reinforcement learning (RL) or related sequential decision-making protocols, with the objective of inducing a target (often “nefarious”) policy while minimizing the magnitude or detectability of the perturbation. Such attacks expose a fundamental vulnerability in reward-driven learning frameworks, spanning single-agent RL, multi-agent games, bandit algorithms, and high-level frameworks like RL from Human Feedback (RLHF), and have prompted significant recent research on both attack algorithms and provable defenses.

1. Foundational Threat Models and Formal Statement

Adversarial reward poisoning is typically cast as an optimal control or bilevel optimization problem, where the attacker aims to minimally perturb the rewards in a dataset (offline RL, MARL) or during online interaction (classical RL, deep RL, bandits) so as to coerce the learner(s) into adopting a specified target policy $\pi^\dagger$ . The attack cost is measured in an $\ell^p$ norm, commonly

$\|\delta r\|_1 = \sum_{k,h,i} | r_{i,h}^k - r^{(0)}_{i,h}^k |,$

where $r^{(0)}$ denotes the original (clean) reward and $\delta r$ the perturbation. The optimization constraints encode the requirement that, post-training, the agent(s) select $\pi^\dagger$ (or, in MARL, a joint profile thereof) as the unique equilibrium or dominant strategy. This goal is formalized as:

Find reward perturbations minimizing $\|\delta r\|_p$
Subject to: the unique equilibrium after training on the poisoned data is the target policy $\pi^\dagger$ (Wu et al., 2022).

For black-box online settings, the attacker may have access only to observed $(s,a,r)$ tuples and seeks to maintain the attack’s stealth and minimal budget while achieving the policy takeover (Rakhsha et al., 2021, Xu et al., 2022). In multi-agent domains, the objective often strengthens to enforce a Markov-Perfect Dominant-Strategy Equilibrium (MPDSE), guaranteeing dominance under worst-case deviations (Wu et al., 2022).

2. Principal Attack Methodologies and Optimization Formulations

Attack strategies vary with the available knowledge (white-box, gray-box, black-box about reward, dynamics, agent algorithm), data access (offline vs. online), and structural assumptions:

Offline MARL: The attacker overwrites rewards in the dataset prior to learning. The most general form targets uncertainty-aware agents, which compute policies by constructing Markov games consistent with observed data and associated confidence intervals. The attack is then

$\min_{\{r_{i,h}^k\}} \|r - r^{(0)}\|_p \;\; \text{subject to}\;\; \pi^\dagger\;\; \text{is the unique MPDSE}$

with respect to all consistent confidence sets $\ell^p$ 0. This may be solved efficiently via linear programming and is generally less costly than attacking each agent independently (Wu et al., 2022).

Online Single-Agent RL: Attacks take the form of real-time reward modification. A classical framework (e.g., adversarial MDP) sets

$\ell^p$ 1

across a subset of steps, with attack cost defined by the number and/or magnitude of nonzero $\ell^p$ 2. Action-inducing and action-evading reward corruption attacks demonstrate that poisoning as little as 5% of steps can drive deep-RL agents to learn deeply suboptimal policies under a $\ell^p$ 3-budgeted regime (Xu et al., 2022).

Clean-Label and Backdoor Poisoning (RLHF): In RL from Human Feedback pipelines, adversaries can craft “clean-label” preference pairs that trigger reward model misalignment, for instance by inducing feature collisions in the reward model’s embedding space (BadReward) (Duan et al., 3 Jun 2025), or by flipping preference labels in “harmlessness” data to bias LLMs toward specific behaviors (e.g., longer responses, trigger-word backdoors) without degrading benign alignment (Wang et al., 2023).

Adversarial Bandits: The attackability of stochastic or combinatorial bandits via reward poisoning is characterized by gap conditions on the structure of rewards under adversarial baselines, with efficient attacks possible only if there is a margin in favor of the target arm/super-arm after suppressing all competitors (Balasubramanian et al., 2023).

Algorithmic Solution Approaches: Across settings, bilevel optimization, penalty-based unconstrained reformulations, and stochastic gradient descent with sample-based gradient estimation are employed for both offline and black-box attackers (Li et al., 2024, Zhang et al., 27 Nov 2025).

3. Limits, Cost, and Impossibility Results

Provable lower and upper bounds on the cost and power of adversarial reward poisoning depend crucially on environmental properties:

Bounded vs. Unbounded Rewards: In bounded reward MDPs, reward-only poisoning cannot, in general, force arbitrary target policies with sublinear cost. Attacks must often combine action and reward manipulation, yielding order-optimal cost $\ell^p$ 4 matching the learner’s regret (Rangi et al., 2022). In unbounded settings (e.g., sub-Gaussian reward support), pure reward manipulation is sufficient to achieve policy takeover with $\ell^p$ 5 cost.
Multi-Agent Games: In MARL, if the attacker can simultaneously manipulate all agents’ rewards, installing the target policy as an MPDSE may be strictly cheaper than the sum of individual attacks (Wu et al., 2022). For some Markov games, pure reward or action attacks fail in isolation, and only a mixed strategy achieves the desired policy with sublinear cost (Liu et al., 2023).
Bandit and Bandit-like Settings: Polynomial-time attackability is completely characterized by structural reward gaps, with practical attack costs scaling sublinearly in $\ell^p$ 6 only when these conditions are satisfied (Balasubramanian et al., 2023).

Table: Attack Cost Under Different Assumptions

Setting	Attack Model	Success Possible	Asymptotic Cost
Bounded reward RL	Reward-only	Sometimes no	Not $\ell^p$ 7 in general
Bounded reward RL	Reward+Action	Yes	$\ell^p$ 8
Unbounded reward	Reward-only	Yes	$\ell^p$ 9
Bandits (CMAB)	Reward-only	If $\\|\delta r\\|_1 = \sum_{k,h,i} \| r_{i,h}^k - r^{(0)}_{i,h}^k \|,$0	$\\|\delta r\\|_1 = \sum_{k,h,i} \| r_{i,h}^k - r^{(0)}_{i,h}^k \|,$1

All results appear directly in (Rangi et al., 2022, Balasubramanian et al., 2023, Liu et al., 2023, Wu et al., 2022).

4. Stealth, Robustness, and Realistic Scenarios

Recent work demonstrates that highly effective reward poisoning can be rendered stealthy by distributing small-magnitude perturbations across many state-action pairs, resulting in agents whose rollout statistics match unpoisoned baselines except under rare, adversarially-triggered scenarios (Zhang et al., 27 Nov 2025). For instance, backdoor reward poisoning in continuous control environments achieves minimal non-triggered performance drop (≤7%) while causing up to 85% collapse under triggers. Standard anomaly detectors (e.g., reward-variance thresholding) are ineffective for small-budget, distributed attacks.

In multi-modal RLHF, clean-label attacks can imperceptibly steer generative models to exhibit malicious, biased, or violent outputs under triggers, at poison ratios as low as 1–3% (Duan et al., 3 Jun 2025, Wang et al., 2023). The high stealth is maintained through pixel-level similarity (high SSIM, low LPIPS), often undetectable by human annotators or standard validation tools.

5. Defenses and Provable Robustness

Multiple papers provide frameworks for developing robust agents resistant to adversarial reward poisoning under minimal or moderate assumptions.

Offline and Tabular RL:

Defensive policies can be computed by optimizing worst-case guarantees over reward sets consistent with a suspicious reward function, sometimes requiring only a solution to a constrained linear program over the occupancy measure simplex (Banihashem et al., 2021). When the poisoning attack is known to enforce a target margin $\|\delta r\|_1 = \sum_{k,h,i} | r_{i,h}^k - r^{(0)}_{i,h}^k |,$2 for $\|\delta r\|_1 = \sum_{k,h,i} | r_{i,h}^k - r^{(0)}_{i,h}^k |,$3, the defender, knowing an upper bound $\|\delta r\|_1 = \sum_{k,h,i} | r_{i,h}^k - r^{(0)}_{i,h}^k |,$4, can guarantee that its worst-case discounted return under the true reward exceeds the observed return under the poisoned reward. Relative suboptimality guarantees (for bandit-like MDPs, $\|\delta r\|_1 = \sum_{k,h,i} | r_{i,h}^k - r^{(0)}_{i,h}^k |,$5 factor) are tight in information-theoretic sense (Banihashem et al., 2021).
In the absence of model knowledge, robust RL approaches and pseudo-posterior-based algorithms (e.g., robust Thompson sampling with explicit attack budgets) offer provably near-optimal regret scaling, with additive dependence on the attack budget $\|\delta r\|_1 = \sum_{k,h,i} | r_{i,h}^k - r^{(0)}_{i,h}^k |,$6 if known, and graceful degradation otherwise (Xu et al., 2024).

Online and Deep RL:

Reward clipping and statistical anomaly filtering constitute fast, low-overhead defenses effective in high-dimensional or real-time DRL deployments. For example, z-score anomaly filters can reject outliers and maintain near-baseline performance under even high-magnitude invert/scale perturbations (Tashman et al., 27 Mar 2026).
Robust in-context RL via adversarial training (AT-DPT) can immunize transformer-based agents to both fixed-budget and adaptive attackers, outperforming classical UCB, Thompson sampling, and robustified bandit baselines (Sasnauskas et al., 7 Jun 2025).

Detection and Auditing:

Defense strategies for RLHF and generative models include adversarial feature sanitization (monitoring embedding distributions), dynamic batch-wise reward statistics, and cross-modal signal validation (Duan et al., 3 Jun 2025).
For stealthy backdoor attacks, runtime anomaly detection on learned Q-functions (Bellman residual validation) and policy-consistency testing under perturbed observations are proposed but not yet fully realized as deployable toolchains (Zhang et al., 27 Nov 2025).

6. Impact on Diverse Sequential Learning Frameworks

Adversarial reward poisoning has been documented as a potent threat across a wide spectrum of RL paradigms:

Offline Multi-Agent RL: Attacks can install arbitrary equilibria with minimal $\|\delta r\|_1 = \sum_{k,h,i} | r_{i,h}^k - r^{(0)}_{i,h}^k |,$7 perturbation cost, leveraging cross-agent coupling (Wu et al., 2022).
Online Deep RL: Black-box attacks succeed against PPO, DQN, SAC, and other DRL schemes across both discrete and MuJoCo environments, often with budgets as low as 5% step-level reward flips (Xu et al., 2022, Zhang et al., 27 Nov 2025).
Robust RLHF and Generative Models: Clean-label and minimal-label-flip attacks compromise preference-based reward models, inducing persistent, concept-driven corruption while remaining undetectable to offline alignment metrics, with quantifiable impact on downstream LLM/GenAI outputs (Duan et al., 3 Jun 2025, Wang et al., 2023).
Combinatorial Bandits and Contextual Bandits: Complete characterizations of attack feasibility and cost, with practical robustified algorithms and gap-dependent lower bounds (Balasubramanian et al., 2023, Xu et al., 2024).

7. Research Directions and Open Challenges

Open challenges remain in scaling provable robustness to high-dimensional, continuous-action RL, MARL with partial observability, and end-to-end RLHF pipelines. Defenses that integrate robust control theory, cryptographically enforced reward authentication, and adaptive policy validation under poisoning remain active research areas (Rangi et al., 2022, Zhang et al., 27 Nov 2025). Future work is needed on minimax lower bounds under combined attacks, algorithmic defenses with limited or partial model knowledge, and auditing protocols for reward and preference model integrity. The field continues to evolve with new attack formulations (e.g., black-box, adaptive, multi-stage) and defenses that operate under increasingly realistic deployment constraints.