Papers
Topics
Authors
Recent
2000 character limit reached

Reward Forcing: Controlled Policy Induction

Updated 5 December 2025
  • Reward Forcing is a set of techniques that directly manipulate reward signals to enforce target policies across reinforcement learning, safe policy teaching, and generative modeling.
  • In adversarial RL, both non-adaptive and adaptive reward perturbation strategies are used to force the agent’s policy, with feasibility determined by safe and attack thresholds under ℓ∞-bounded constraints.
  • In video generation and policy teaching, reward forcing via mechanisms like Re-DMD and surrogate reward design improves dynamic content fidelity and safety, despite challenges such as NP-hard optimization.

Reward Forcing encompasses a class of techniques across reinforcement learning, adversarial RL, and generative modeling that seek to directly control agent or model behavior by modifying, constraining, or adaptively shaping the reward signal or learning loss so as to enforce target policies, priorities, or characteristics. The term appears in three central and technically disjoint contexts: (i) adversarial reward-poisoning in RL where an attacker perturbs rewards to enforce specific (potentially nefarious) behavior, (ii) constructive admissible policy teaching where a policy designer seeks to force an agent to avoid inadmissible actions via reward design, and (iii) streaming video generation as a framework for distillation that aggressively biases learning toward dynamic, high-reward content. Each formulation leverages reward manipulation for policy induction, but technical approaches, tractability, and application domains vary substantially.

1. Reward Forcing in Adversarial Reinforcement Learning

In adversarial RL, reward forcing refers to the capacity of an external attacker to deterministically drive an agent to a pre-specified (possibly adversarial) policy by perturbing observed rewards within \ell_\infty-bounded constraints. The canonical setup considers a finite discounted Markov Decision Process (MDP) M=(S,A,P,R,γ)\mathcal{M}=(S,A,P,R,\gamma) and a Q-learning agent. At each step tt, the agent receives a manipulated reward r~t=rt+δt\tilde r_t = r_t + \delta_t with δtϵ|\delta_t| \leq \epsilon, where ϵ\epsilon bounds the maximum per-step reward corruption (Zhang et al., 2020).

Attack strategies are divided into:

  • Non-adaptive attacks: δt\delta_t is a fixed function of the (state, action, next state), independent of the agent's Q-table.
  • Adaptive attacks: δt\delta_t is allowed to depend on the agent’s internal state (e.g., the Q-table QtQ_t), enabling dynamic shaping during training.

The attacker's goal is to force the agent's greedy policy πt(s)=argmaxaQt(s,a)\pi_t(s) = \arg\max_a Q_t(s,a) to match a target partial policy π\pi^\dagger, corresponding to a set of target Q-tables Q\mathcal{Q}^\dagger. The attack is feasible if, under some corruption ϕ\phi, the agent's Q-table enters Q\mathcal{Q}^\dagger in finite expected time. Attack cost is quantified as the expected time to reach Q\mathcal{Q}^\dagger.

2. Feasibility Regimes and Safe Radius

Reward forcing feasibility is governed by sharp thresholds on the attacker’s allowable budget ϵ\epsilon.

  • Safe Radius (ϵsafe\epsilon_{\mathrm{safe}}): If ϵ<ϵsafe\epsilon < \epsilon_{\mathrm{safe}}, even the optimal attacker cannot force the learned policy to deviate from the environment-optimal π\pi^*. Formally,

ϵsafe=1γ2mins[Q(s,π(s))maxaπ(s)Q(s,a)].\epsilon_{\mathrm{safe}} = \frac{1-\gamma}{2} \min_{s} \Bigl[ Q^*(s, \pi^*(s)) - \max_{a\ne\pi^*(s)} Q^*(s,a) \Bigr].

Below this threshold, the agent is certified safe under all possible \ell_\infty-bounded attacks.

  • Attack Threshold (ϵattack\epsilon_{\mathrm{attack}}): For any ϵ>ϵattack\epsilon > \epsilon_{\mathrm{attack}} there exists a non-adaptive reward shaping strategy such that the attacked Q-learning converges to QQQ' \in \mathcal{Q}^\dagger, with

ϵattack=1+γ2maxsS[maxaπ(s)Q(s,a)maxaπ(s)Q(s,a)]+.\epsilon_{\mathrm{attack}} = \frac{1+\gamma}{2} \max_{s \in S^\dagger} \bigl[ \max_{a \notin \pi^\dagger(s)} Q^*(s,a) - \max_{a \in \pi^\dagger(s)} Q^*(s,a) \bigr]_+.

If ϵmaxs,aR(s,a)R(s,a)\epsilon \geq \max_{s,a} |R'(s,a) - R(s,a)|, where RR' is the shaped reward ensuring the desired Q-table, the attack necessarily succeeds. There exists a regime ϵsafe<ϵ<ϵattack\epsilon_{\mathrm{safe}} < \epsilon < \epsilon_{\mathrm{attack}} in which attack feasibility is not guaranteed.

3. Complexity of Reward Forcing: Adaptive vs. Non-adaptive Attacks

  • Non-adaptive reward forcing is limited by the coverage time LL of the agent’s exploration strategy. The worst-case expected cost (number of steps to force the policy) scales as O(L5)=exp(Θ(S))O(L^5) =\exp(\Theta(|S|)) due to reliance on random visitation of all state-action pairs.
  • Adaptive reward forcing leverages online access to the agent’s Q-table to direct exploration toward key states, reducing cost to polynomial in the state-space size under mild assumptions.

The improvement is due to the attacker’s capacity to sequentially “teach” target states by adaptively amplifying or discouraging actions based on the agent’s learning progress (Zhang et al., 2020).

4. Constructive Algorithms: Fast Adaptive Attack and Surrogate Policy Teaching

Fast Adaptive Attack (FAA)

FAA operationalizes adaptive reward forcing by organizing the attack into k=Sk = |S^\dagger| phases, each aimed at aligning the agent's greedy action in a specific target state:

  • For each tt, identify the first untaught state where argmaxaQt(s(i),a)π(s(i))\arg\max_a Q_t(s^\dagger_{(i)}, a) \notin \pi^\dagger(s^\dagger_{(i)}).
  • Construct a temporary policy νi\nu_i steering toward s(i)s^\dagger_{(i)} (respecting previously aligned targets).
  • Apply greedy Q-shaping: choose minimal δt[ϵ,ϵ]\delta_t \in [-\epsilon, \epsilon] such that the preferred action under νi\nu_i is locally optimal. FAA achieves attack cost J(FAA)J_\infty(\mathrm{FAA}) bounded by a polynomial in S,A,k|S|, |A|, k (including the maximum episode length and ϵ\epsilon-diameter of the MDP), provided k=O(logS)k = O(\log |S|).

Admissible Policy Teaching via Reward Design

Complementary to adversarial reward-forcing, admissible policy teaching considers constructive reward design to enforce policy constraints. The designer seeks to minimally modify the MDP reward RR so that every near-optimal policy avoids inadmissible actions, while minimizing both the design cost RR2\|\overline{R} - R\|_2 and performance degradation under the original rewards.

The exact optimization (P1–APT) is: minR  {  maxπOPT(R)RR2λρπ,R}  s.t.  OPT(R)Πadm,\min_{R}\;\Bigl\{\;\max_{π\in\mathit{OPT}(R)}\|\overline R - R\|_2 - λ\,\rho^{π,\overline R}\Bigr\} \;\text{s.t.}\; \mathit{OPT}(R) \subseteq \Pi^{adm}, where OPT(R)\mathit{OPT}(R) is the set of ϵ\epsilon-optimal policies under RR. This problem is NP-hard (via reduction from X3C), and even approximate solutions are intractable within any polynomial factor (Banihashem et al., 2022). A surrogate formulation (P4–APT) enables joint selection of an admissible policy and reward modification, yielding provable additive approximations. Local search (Constrain) algorithms operating over policies and admissible sets allow practical solution of the surrogate with empirical performance validated in grid-world domains.

5. Reward Forcing in Efficient Video Generation

Reward Forcing also describes a framework for reward-augmented distillation in streaming video generation (Lu et al., 4 Dec 2025). Here, the objective is to distill a multi-step bidirectional diffusion video model into a low-latency, few-step autoregressive student suitable for streaming. Two principal innovations are central:

  • EMA-Sink: Static “attention sink” tokens representing the initial frames are replaced with an exponentially moving average of evicted key–value pairs as frames exit the sliding window. At each generation step ii, fusing of memory is accomplished by:

SKi=αSKi1+(1α)Kiw\mathbf S^i_K = \alpha \mathbf S^{i-1}_K + (1-\alpha)\mathbf K^{i-w}

and similarly for SVi\mathbf S^i_V. This mechanism preserves long-horizon context and avoids over-dependence on initial frames, mitigating the copying artifact that degrades motion quality.

  • Rewarded Distribution Matching Distillation (Re-DMD): Standard distribution-matching distillation minimizes DKL(pfakepreal)\mathrm{D_{KL}}(p_{\rm fake}\Vert p_{\rm real}) across all samples, insufficiently penalizing static predicted sequences. Re-DMD introduces sample-weighting with a vision-language “motion quality” reward r(x)r(x):

LReDMD=DKL(pfakepreal)βExpfake[r(x)logpfake(x)],L_{\rm ReDMD} =\mathrm{D_{KL}}(p_{\rm fake}\Vert p_{\rm real}) - \beta\,\mathbb{E}_{x\sim p_{\rm fake}}[r(x)\,\log p_{\rm fake}(x)],

where r(x)r(x) quantifies dynamic content and β\beta trades off fidelity and motion richness. Samples with higher dynamic motion (as scored by the reward model) have up-weighted gradient contributions during distillation. This prioritization biases the student toward high-dynamics regions of the teacher's output space.

In combination, EMA-Sink and Re-DMD significantly increase motion dynamics and consistency in long-horizon video synthesis without additional runtime cost. Empirically, this approach sets new benchmarks in both short and long video generation (e.g., VBench-Long dynamic score: 66.95 vs 35.54 for prior best), achieving efficient (23.1 FPS on single H100) and high-fidelity streaming synthesis (Lu et al., 4 Dec 2025).

6. Empirical Evidence and Observed Limitations

Reward forcing, across adversarial RL and generative modeling, demonstrates that direct reward modification—whether for attack, policy teaching, or sample prioritization—can strongly influence learning outcome.

Table 1. Comparative Summary of Reward Forcing Paradigms

Paradigm Objective Tractability
Adversarial RL (Zhang et al., 2020) Force (target/undesirable) policy Polynomial (adaptive), Exponential (non-adaptive)
Policy Teaching (Banihashem et al., 2022) Force admissible (safe) policy NP-hard (exact, approx.)
Video Generation (Lu et al., 4 Dec 2025) Enforce dynamic, rich content Efficient, scalable

Empirical studies confirm:

  • In adversarial RL, adaptive attacks rapidly force policies; e.g., on S=12|S|=12 chain, adaptive attack costs grow linearly vs. exponential for non-adaptive (Zhang et al., 2020).
  • In video generation, motion quality sharply increases with Re-DMD and EMA-Sink; ablation shows without Re-DMD, measured dynamics drop from 64.06 to 43.75, and without EMA-Sink to 35.15 (Lu et al., 4 Dec 2025).
  • In admissible policy teaching, local search algorithms outperform baselines in design cost-performance tradeoff in grid-world safety domains (Banihashem et al., 2022).

7. Connections, Contrast, and Implications

Reward Forcing unifies disparate research lines centered on reward modification to govern learning behavior. Theoretical results highlight both the power and risk of reward manipulation: adaptive attacks can efficiently subvert RL training, while constructive design can—albeit with difficulty—guarantee safety. In model distillation, reward-weighted objectives address deficiencies of uniform loss landscapes by focusing optimization on functionally important behaviors.

The intractability results for policy teaching suggest limits to what can be guaranteed via reward design, inviting further investigation into efficient (approximate) surrogates or domain-specific relaxations. The success of reward-aware distillation frameworks in generative modeling indicates that auxiliary reward models can guide sample efficiency and output quality beyond what is possible with vanilla empirical risk minimization.

A plausible implication is that in any setting where policy robustness or behavioral alignment is critical, explicit reward forcing—either in defense (design) or prioritized training (distillation)—is likely to become a standard component of the reinforcement learning pipeline.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Reward Forcing.