Reward Forcing: Controlled Policy Induction

Updated 5 December 2025

Reward Forcing is a set of techniques that directly manipulate reward signals to enforce target policies across reinforcement learning, safe policy teaching, and generative modeling.
In adversarial RL, both non-adaptive and adaptive reward perturbation strategies are used to force the agent’s policy, with feasibility determined by safe and attack thresholds under ℓ∞-bounded constraints.
In video generation and policy teaching, reward forcing via mechanisms like Re-DMD and surrogate reward design improves dynamic content fidelity and safety, despite challenges such as NP-hard optimization.

Reward Forcing encompasses a class of techniques across reinforcement learning, adversarial RL, and generative modeling that seek to directly control agent or model behavior by modifying, constraining, or adaptively shaping the reward signal or learning loss so as to enforce target policies, priorities, or characteristics. The term appears in three central and technically disjoint contexts: (i) adversarial reward-poisoning in RL where an attacker perturbs rewards to enforce specific (potentially nefarious) behavior, (ii) constructive admissible policy teaching where a policy designer seeks to force an agent to avoid inadmissible actions via reward design, and (iii) streaming video generation as a framework for distillation that aggressively biases learning toward dynamic, high-reward content. Each formulation leverages reward manipulation for policy induction, but technical approaches, tractability, and application domains vary substantially.

1. Reward Forcing in Adversarial Reinforcement Learning

In adversarial RL, reward forcing refers to the capacity of an external attacker to deterministically drive an agent to a pre-specified (possibly adversarial) policy by perturbing observed rewards within $\ell_\infty$ -bounded constraints. The canonical setup considers a finite discounted Markov Decision Process (MDP) $\mathcal{M}=(S,A,P,R,\gamma)$ and a Q-learning agent. At each step $t$ , the agent receives a manipulated reward $\tilde r_t = r_t + \delta_t$ with $|\delta_t| \leq \epsilon$ , where $\epsilon$ bounds the maximum per-step reward corruption (Zhang et al., 2020).

Attack strategies are divided into:

Non-adaptive attacks: $\delta_t$ is a fixed function of the (state, action, next state), independent of the agent's Q-table.
Adaptive attacks: $\delta_t$ is allowed to depend on the agent’s internal state (e.g., the Q-table $Q_t$ ), enabling dynamic shaping during training.

The attacker's goal is to force the agent's greedy policy $\pi_t(s) = \arg\max_a Q_t(s,a)$ to match a target partial policy $\pi^\dagger$ , corresponding to a set of target Q-tables $\mathcal{Q}^\dagger$ . The attack is feasible if, under some corruption $\phi$ , the agent's Q-table enters $\mathcal{Q}^\dagger$ in finite expected time. Attack cost is quantified as the expected time to reach $\mathcal{Q}^\dagger$ .

2. Feasibility Regimes and Safe Radius

Reward forcing feasibility is governed by sharp thresholds on the attacker’s allowable budget $\epsilon$ .

Safe Radius ( $\epsilon_{\mathrm{safe}}$ ): If $\epsilon < \epsilon_{\mathrm{safe}}$ , even the optimal attacker cannot force the learned policy to deviate from the environment-optimal $\pi^*$ . Formally,

$\epsilon_{\mathrm{safe}} = \frac{1-\gamma}{2} \min_{s} \Bigl[ Q^*(s, \pi^*(s)) - \max_{a\ne\pi^*(s)} Q^*(s,a) \Bigr].$

Below this threshold, the agent is certified safe under all possible $\ell_\infty$ -bounded attacks.

Attack Threshold ( $\epsilon_{\mathrm{attack}}$ ): For any $\epsilon > \epsilon_{\mathrm{attack}}$ there exists a non-adaptive reward shaping strategy such that the attacked Q-learning converges to $Q' \in \mathcal{Q}^\dagger$ , with

$\epsilon_{\mathrm{attack}} = \frac{1+\gamma}{2} \max_{s \in S^\dagger} \bigl[ \max_{a \notin \pi^\dagger(s)} Q^*(s,a) - \max_{a \in \pi^\dagger(s)} Q^*(s,a) \bigr]_+.$

If $\epsilon \geq \max_{s,a} |R'(s,a) - R(s,a)|$ , where $R'$ is the shaped reward ensuring the desired Q-table, the attack necessarily succeeds. There exists a regime $\epsilon_{\mathrm{safe}} < \epsilon < \epsilon_{\mathrm{attack}}$ in which attack feasibility is not guaranteed.

3. Complexity of Reward Forcing: Adaptive vs. Non-adaptive Attacks

Non-adaptive reward forcing is limited by the coverage time $L$ of the agent’s exploration strategy. The worst-case expected cost (number of steps to force the policy) scales as $O(L^5) =\exp(\Theta(|S|))$ due to reliance on random visitation of all state-action pairs.
Adaptive reward forcing leverages online access to the agent’s Q-table to direct exploration toward key states, reducing cost to polynomial in the state-space size under mild assumptions.

The improvement is due to the attacker’s capacity to sequentially “teach” target states by adaptively amplifying or discouraging actions based on the agent’s learning progress (Zhang et al., 2020).

4. Constructive Algorithms: Fast Adaptive Attack and Surrogate Policy Teaching

Fast Adaptive Attack (FAA)

FAA operationalizes adaptive reward forcing by organizing the attack into $k = |S^\dagger|$ phases, each aimed at aligning the agent's greedy action in a specific target state:

For each $t$ , identify the first untaught state where $\arg\max_a Q_t(s^\dagger_{(i)}, a) \notin \pi^\dagger(s^\dagger_{(i)})$ .
Construct a temporary policy $\nu_i$ steering toward $s^\dagger_{(i)}$ (respecting previously aligned targets).
Apply greedy Q-shaping: choose minimal $\delta_t \in [-\epsilon, \epsilon]$ such that the preferred action under $\nu_i$ is locally optimal. FAA achieves attack cost $J_\infty(\mathrm{FAA})$ bounded by a polynomial in $|S|, |A|, k$ (including the maximum episode length and $\epsilon$ -diameter of the MDP), provided $k = O(\log |S|)$ .

Admissible Policy Teaching via Reward Design

Complementary to adversarial reward-forcing, admissible policy teaching considers constructive reward design to enforce policy constraints. The designer seeks to minimally modify the MDP reward $R$ so that every near-optimal policy avoids inadmissible actions, while minimizing both the design cost $\|\overline{R} - R\|_2$ and performance degradation under the original rewards.

The exact optimization (P1–APT) is: $\min_{R}\;\Bigl\{\;\max_{π\in\mathit{OPT}(R)}\|\overline R - R\|_2 - λ\,\rho^{π,\overline R}\Bigr\} \;\text{s.t.}\; \mathit{OPT}(R) \subseteq \Pi^{adm},$ where $\mathit{OPT}(R)$ is the set of $\epsilon$ -optimal policies under $R$ . This problem is NP-hard (via reduction from X3C), and even approximate solutions are intractable within any polynomial factor (Banihashem et al., 2022). A surrogate formulation (P4–APT) enables joint selection of an admissible policy and reward modification, yielding provable additive approximations. Local search (Constrain) algorithms operating over policies and admissible sets allow practical solution of the surrogate with empirical performance validated in grid-world domains.

5. Reward Forcing in Efficient Video Generation

Reward Forcing also describes a framework for reward-augmented distillation in streaming video generation (Lu et al., 4 Dec 2025). Here, the objective is to distill a multi-step bidirectional diffusion video model into a low-latency, few-step autoregressive student suitable for streaming. Two principal innovations are central:

EMA-Sink: Static “attention sink” tokens representing the initial frames are replaced with an exponentially moving average of evicted key–value pairs as frames exit the sliding window. At each generation step $i$ , fusing of memory is accomplished by:

$\mathbf S^i_K = \alpha \mathbf S^{i-1}_K + (1-\alpha)\mathbf K^{i-w}$

and similarly for $\mathbf S^i_V$ . This mechanism preserves long-horizon context and avoids over-dependence on initial frames, mitigating the copying artifact that degrades motion quality.

Rewarded Distribution Matching Distillation (Re-DMD): Standard distribution-matching distillation minimizes $\mathrm{D_{KL}}(p_{\rm fake}\Vert p_{\rm real})$ across all samples, insufficiently penalizing static predicted sequences. Re-DMD introduces sample-weighting with a vision-language “motion quality” reward $r(x)$ :

$L_{\rm ReDMD} =\mathrm{D_{KL}}(p_{\rm fake}\Vert p_{\rm real}) - \beta\,\mathbb{E}_{x\sim p_{\rm fake}}[r(x)\,\log p_{\rm fake}(x)],$

where $r(x)$ quantifies dynamic content and $\beta$ trades off fidelity and motion richness. Samples with higher dynamic motion (as scored by the reward model) have up-weighted gradient contributions during distillation. This prioritization biases the student toward high-dynamics regions of the teacher's output space.

In combination, EMA-Sink and Re-DMD significantly increase motion dynamics and consistency in long-horizon video synthesis without additional runtime cost. Empirically, this approach sets new benchmarks in both short and long video generation (e.g., VBench-Long dynamic score: 66.95 vs 35.54 for prior best), achieving efficient (23.1 FPS on single H100) and high-fidelity streaming synthesis (Lu et al., 4 Dec 2025).

6. Empirical Evidence and Observed Limitations

Reward forcing, across adversarial RL and generative modeling, demonstrates that direct reward modification—whether for attack, policy teaching, or sample prioritization—can strongly influence learning outcome.

Table 1. Comparative Summary of Reward Forcing Paradigms

Paradigm	Objective	Tractability
Adversarial RL (Zhang et al., 2020)	Force (target/undesirable) policy	Polynomial (adaptive), Exponential (non-adaptive)
Policy Teaching (Banihashem et al., 2022)	Force admissible (safe) policy	NP-hard (exact, approx.)
Video Generation (Lu et al., 4 Dec 2025)	Enforce dynamic, rich content	Efficient, scalable

Empirical studies confirm:

In adversarial RL, adaptive attacks rapidly force policies; e.g., on $|S|=12$ chain, adaptive attack costs grow linearly vs. exponential for non-adaptive (Zhang et al., 2020).
In video generation, motion quality sharply increases with Re-DMD and EMA-Sink; ablation shows without Re-DMD, measured dynamics drop from 64.06 to 43.75, and without EMA-Sink to 35.15 (Lu et al., 4 Dec 2025).
In admissible policy teaching, local search algorithms outperform baselines in design cost-performance tradeoff in grid-world safety domains (Banihashem et al., 2022).

7. Connections, Contrast, and Implications

Reward Forcing unifies disparate research lines centered on reward modification to govern learning behavior. Theoretical results highlight both the power and risk of reward manipulation: adaptive attacks can efficiently subvert RL training, while constructive design can—albeit with difficulty—guarantee safety. In model distillation, reward-weighted objectives address deficiencies of uniform loss landscapes by focusing optimization on functionally important behaviors.

The intractability results for policy teaching suggest limits to what can be guaranteed via reward design, inviting further investigation into efficient (approximate) surrogates or domain-specific relaxations. The success of reward-aware distillation frameworks in generative modeling indicates that auxiliary reward models can guide sample efficiency and output quality beyond what is possible with vanilla empirical risk minimization.

A plausible implication is that in any setting where policy robustness or behavioral alignment is critical, explicit reward forcing—either in defense (design) or prioritized training (distillation)—is likely to become a standard component of the reinforcement learning pipeline.