Papers
Topics
Authors
Recent
Search
2000 character limit reached

Policy Reconstruction in RL

Updated 31 January 2026
  • Policy Reconstruction is a set of adversarial techniques that use inverse RL to infer unknown reward functions from deployed RL policies.
  • It quantifies reconstruction error using metrics like L1, L2, L∞ distances, demonstrating that standard DP methods may not effectively protect sensitive rewards.
  • Empirical studies reveal that despite tighter privacy budgets, even DP-enhanced RL methods can leak reward details, urging the need for new privacy-centric mechanisms.

Policy reconstruction refers to adversarial techniques for extracting or estimating characteristics of an unknown reward function underlying a published or deployed reinforcement learning (RL) policy. In privacy-sensitive settings such as autonomous driving or recommendation systems, policies trained by RL may implicitly encode private preferences or objectives. "How Private Is Your RL Policy? An Inverse RL Based Analysis Framework" (Prakash et al., 2021) introduces a systematic framework—Privacy-Aware Inverse RL (PRIL)—that formalizes and operationalizes reward-reconstruction attacks against privacy-preserving RL policies. By quantifying the fidelity of reward recovery via inverse RL, the framework exposes substantial gaps between standard differential privacy guarantees applied to policy-training mechanisms and the effective protection of sensitive reward functions.

1. Formalization of the Reward-Reconstruction Attack

The PRIL framework defines the reward-reconstruction attack as an adversarial process: given a deployed RL policy π\pi trained on an unknown reward function R:SRR : S \to \mathbb{R}, an adversary applies inverse RL to infer a reconstructed reward R^\hat R. The framework evaluates both non-private policies (π\pi') and differentially-private policies (π\pi''), measuring privacy via a set of reward-distance metrics between RR and the recovered R^\hat R:

  • Non-private baseline: π\pi' trained directly on RR.
  • Private policy: π\pi'' trained on RR with a chosen differential privacy (DP) mechanism.

The attack pipeline is:

  1. Apply inverse RL to π\pi' to obtain R^\hat R'.
  2. Apply inverse RL to π\pi'' to obtain R^\hat R''.
  3. Compute distances d(R,R^)d'(R, \hat R') and d(R,R^)d''(R, \hat R'') using several metrics.
  4. If dd'' is large, the private policy π\pi'' offers strong reward privacy. Otherwise, RR is vulnerable to reconstruction.

2. Inverse RL Algorithm: Finite-State LP Formulation

Reward reconstruction leverages the classical finite-state Inverse RL algorithm of Ng and Russell (2000), instantiated as a linear program (LP) that finds the minimum-norm reward vector rRSr \in \mathbb{R}^{|S|} making the input policy π\pi uniquely optimal by at least a margin of 1. The optimization is:

  • Objective:

minrr1\min_{r} \|r\|_{1}

  • Optimality-by-margin constraints (for every state sSs \in S and every action aπ(s)a \ne \pi(s)):

Qr(s,π(s))Qr(s,a)1Q_{r}(s,\pi(s)) - Q_{r}(s,a) \ge 1

where

Qr(s,a)=r(s)+γsP(s,a,s)Vr(s)Q_{r}(s,a) = r(s) + \gamma\sum_{s'}P(s,a,s') V_{r}(s')

Vr=(IγPπ)1rV_{r} = (I - \gamma P_{\pi})^{-1} r

The full LP can be written as:

minrRSsSr(s) s.t.(es(IγPπ)1es(IγPa)1)r1,  sS,aπ(s)\begin{aligned} &\min_{r \in \mathbb{R}^{|S|}} && \sum_{s \in S} |r(s)| \ &\text{s.t.} && \left(e_s^\top (I-\gamma P_{\pi})^{-1} - e_s^\top (I-\gamma P_{a})^{-1}\right) r \ge 1,\;\forall s \in S,\forall a \ne \pi(s) \end{aligned}

The LP returns R^\hat R as the reconstructed reward vector consistent with policy π\pi.

3. Differential Privacy in RL Algorithms

Three major classes of DP mechanisms are evaluated in PRIL:

  • Value Iteration + DP-Bellman (VI-DP-Bellman):
    • Gaussian noise N(0,σ2)\mathcal{N}(0,\sigma^2) is added to each Bellman update:

    V(s)maxasP(s,a,s)[r(s)+γV(s)]+N(0,σ2)V(s) \leftarrow \max_a \sum_{s'} P(s,a,s') \left[r(s) + \gamma V(s')\right] + \mathcal{N}(0, \sigma^2) - Sensitivity for the update is Δ2f=SS1\Delta_2 f = \frac{|S|}{|S|-1}. - Rényi-DP theory dictates:

    σΔ2f2log(1.25/δ)ϵ\sigma \ge \frac{\Delta_2 f \sqrt{2 \log(1.25/\delta)}}{\epsilon}

  • Deep Q Network (DQN): DP-SGD, DP-Adam, DP-Shoe, DP-FN:

    • DP-SGD/Adam/Shoe: per-example gradients are clipped to norm C=1.0C=1.0 and Gaussian noise N(0,σ2C2I)\mathcal{N}(0,\,\sigma^2C^2I) is added.
    • DP-Shoe uses SGD + tanh activations; DP-SGD/DP-Adam use ReLU.
    • DP-FN: functional Gaussian process noise is added directly to Q-value estimates, ensuring (ϵ,δ)(\epsilon, \delta)-DP.
  • Proximal Policy Optimization (PPO):
    • Only the actor network updates are privatized using DP-SGD, DP-Adam, or DP-Shoe.
    • Critic updates remain non-private.
    • The same gradient clipping and Gaussian noise are applied as in DQN.

Privacy parameters in experiments:

  • ϵ{0.1,0.105,0.2,0.5,1.0,2.0,5.0,10.0,}\epsilon \in \{0.1,\,0.105,\,0.2,\,0.5,\,1.0,\,2.0,\,5.0,\,10.0,\infty\}
  • δ=105\delta = 10^{-5}
  • σ\sigma is computed using TensorFlow-Privacy’s RDP-to-DP conversion.

4. Quantitative Reward-Distance Metrics

PRIL assesses reconstruction error with four metrics, operating on reward vectors RR and R^\hat R (normalized):

Metric Formula Description
L1L_1 distance d1(R,R^)=sSR(s)R^(s)d_1(R, \hat R) = \sum_{s \in S} |R(s)-\hat R(s)| Total variation
L2L_2 distance d2(R,R^)=(sS(R(s)R^(s))2)1/2d_2(R, \hat R) = \left(\sum_{s \in S}(R(s)-\hat R(s))^2\right)^{1/2} Euclidean error
LL_\infty distance d(R,R^)=maxsSR(s)R^(s)d_\infty(R, \hat R) = \max_{s \in S}|R(s)-\hat R(s)| Maximum deviation
Sign-change count dsign(R,R^)={sS:sign(R(s))sign(R^(s))}d_{\rm sign}(R, \hat R) = |\{s\in S: \mathrm{sign}(R(s))\ne\mathrm{sign}(\hat R(s))\}| Reward sign flips across states

These metrics directly quantify the adversary’s ability to reconstruct the true underlying reward.

5. Empirical Evaluation: FrozenLake Benchmarks and Results

Experiments are carried out on 24 customized FrozenLake domains (12 of 5×55 \times 5 and 12 of 10×1010 \times 10 grids), featuring cell types: Safe (S), Frozen (F), Hole (H), High-reward (A), and Goal (G), with near-deterministic transitions (slip factor =0.0001=0.0001).

  • Algorithms Evaluated: VI-DP-Bellman, DQN-DP-SGD, DQN-DP-Shoe, DQN-DP-Adam, DQN-DP-FN, PPO-DP-SGD, PPO-DP-Shoe, PPO-DP-Adam, plus non-private baselines.
  • Training Protocols: DQN/PPO use 15 epochs, 200 iterations, batch size 50, micro-batches 5, learning rate 0.15, discount γ=0.99\gamma = 0.99. VI runs Bellman updates until convergence.

Policy extraction and reward reconstruction are performed for each ϵ\epsilon and algorithm over 10 random seeds. Distances d1d_1, d2d_2, dd_\infty, and dsignd_{\rm sign} are computed.

Key findings:

  • All four reconstruction error metrics are flat as a function of ϵ\epsilon: increasing the privacy budget does not substantially decrease d(R,R^)d(R, \hat R'').
  • Aggregated L2L_2 error ranks: DQN-variants \approx PPO-variants >> VI-DP-Bellman. E.g., d220d_2 \approx 20 (DQN), d210d_2 \approx 10 (PPO), d25d_2 \approx 5 (VI-DP-Bellman) for 5×55\times5 environments.
  • The utility–privacy trade-off is present for policy returns (lower ϵ\epsilon reduces return), but reward reconstruction error remains insensitive to ϵ\epsilon.

6. Privacy-Gap Analysis and Implications

A fundamental insight is that differential privacy mechanisms protecting policy updates—such as per-iteration gradient noise or DP-Bellman—do not guarantee privacy for the underlying reward function. Inverse RL attacks in PRIL can reconstruct RR from π\pi'' with a small, constant error substantially independent of ϵ\epsilon. Classical non-deep methods (VI-DP-Bellman) provide less reward privacy than deep RL approaches (DQN/PPO), yet the improvement is insufficient for privacy-critical domains.

This suggests a significant mismatch between the theoretical privacy guarantees (in terms of (ϵ,δ)(\epsilon,\delta)-DP applied during training) and the protection actually required to conceal the true reward. A plausible implication is that future DP-RL research must either develop reward-centric privacy mechanisms—directly privatizing RR—or redefine DP guarantees to explicitly bound reward-reconstruction error under adversarial inverse RL.

7. Summary and Directions

PRIL provides a rigorous framework for adversarial assessment of reward privacy in RL. It leverages finite-state LP inverse RL to reconstruct the reward from policies trained under various DP mechanisms and quantifies the privacy leakage via multiple reward-distance metrics. Empirical results clearly indicate a gap: prevailing DP-RL techniques do not prevent effective reward reconstruction, exposing the need for fundamentally stronger privacy definitions and mechanisms in RL policy deployment (Prakash et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Policy Reconstruction.