Policy Reconstruction in RL

Updated 31 January 2026

Policy Reconstruction is a set of adversarial techniques that use inverse RL to infer unknown reward functions from deployed RL policies.
It quantifies reconstruction error using metrics like L1, L2, L∞ distances, demonstrating that standard DP methods may not effectively protect sensitive rewards.
Empirical studies reveal that despite tighter privacy budgets, even DP-enhanced RL methods can leak reward details, urging the need for new privacy-centric mechanisms.

Policy reconstruction refers to adversarial techniques for extracting or estimating characteristics of an unknown reward function underlying a published or deployed reinforcement learning (RL) policy. In privacy-sensitive settings such as autonomous driving or recommendation systems, policies trained by RL may implicitly encode private preferences or objectives. "How Private Is Your RL Policy? An Inverse RL Based Analysis Framework" (Prakash et al., 2021) introduces a systematic framework—Privacy-Aware Inverse RL (PRIL)—that formalizes and operationalizes reward-reconstruction attacks against privacy-preserving RL policies. By quantifying the fidelity of reward recovery via inverse RL, the framework exposes substantial gaps between standard differential privacy guarantees applied to policy-training mechanisms and the effective protection of sensitive reward functions.

1. Formalization of the Reward-Reconstruction Attack

The PRIL framework defines the reward-reconstruction attack as an adversarial process: given a deployed RL policy $\pi$ trained on an unknown reward function $R : S \to \mathbb{R}$ , an adversary applies inverse RL to infer a reconstructed reward $\hat R$ . The framework evaluates both non-private policies ( $\pi'$ ) and differentially-private policies ( $\pi''$ ), measuring privacy via a set of reward-distance metrics between $R$ and the recovered $\hat R$ :

Non-private baseline: $\pi'$ trained directly on $R$ .
Private policy: $\pi''$ trained on $R$ with a chosen differential privacy (DP) mechanism.

The attack pipeline is:

Apply inverse RL to $\pi'$ to obtain $\hat R'$ .
Apply inverse RL to $\pi''$ to obtain $\hat R''$ .
Compute distances $d'(R, \hat R')$ and $d''(R, \hat R'')$ using several metrics.
If $d''$ is large, the private policy $\pi''$ offers strong reward privacy. Otherwise, $R$ is vulnerable to reconstruction.

2. Inverse RL Algorithm: Finite-State LP Formulation

Reward reconstruction leverages the classical finite-state Inverse RL algorithm of Ng and Russell (2000), instantiated as a linear program (LP) that finds the minimum-norm reward vector $r \in \mathbb{R}^{|S|}$ making the input policy $\pi$ uniquely optimal by at least a margin of 1. The optimization is:

Objective:

$\min_{r} \|r\|_{1}$

Optimality-by-margin constraints (for every state $s \in S$ and every action $a \ne \pi(s)$ ):

$Q_{r}(s,\pi(s)) - Q_{r}(s,a) \ge 1$

where

$Q_{r}(s,a) = r(s) + \gamma\sum_{s'}P(s,a,s') V_{r}(s')$

$V_{r} = (I - \gamma P_{\pi})^{-1} r$

The full LP can be written as:

$\begin{aligned} &\min_{r \in \mathbb{R}^{|S|}} && \sum_{s \in S} |r(s)| \ &\text{s.t.} && \left(e_s^\top (I-\gamma P_{\pi})^{-1} - e_s^\top (I-\gamma P_{a})^{-1}\right) r \ge 1,\;\forall s \in S,\forall a \ne \pi(s) \end{aligned}$

The LP returns $\hat R$ as the reconstructed reward vector consistent with policy $\pi$ .

3. Differential Privacy in RL Algorithms

Three major classes of DP mechanisms are evaluated in PRIL:

Value Iteration + DP-Bellman (VI-DP-Bellman):
- Gaussian noise $\mathcal{N}(0,\sigma^2)$ is added to each Bellman update:
$V(s) \leftarrow \max_a \sum_{s'} P(s,a,s') \left[r(s) + \gamma V(s')\right] + \mathcal{N}(0, \sigma^2)$ - Sensitivity for the update is $\Delta_2 f = \frac{|S|}{|S|-1}$ . - Rényi-DP theory dictates:

$\sigma \ge \frac{\Delta_2 f \sqrt{2 \log(1.25/\delta)}}{\epsilon}$
Deep Q Network (DQN): DP-SGD, DP-Adam, DP-Shoe, DP-FN:
- DP-SGD/Adam/Shoe: per-example gradients are clipped to norm $C=1.0$ and Gaussian noise $\mathcal{N}(0,\,\sigma^2C^2I)$ is added.
- DP-Shoe uses SGD + tanh activations; DP-SGD/DP-Adam use ReLU.
- DP-FN: functional Gaussian process noise is added directly to Q-value estimates, ensuring $(\epsilon, \delta)$ -DP.
Proximal Policy Optimization (PPO):
- Only the actor network updates are privatized using DP-SGD, DP-Adam, or DP-Shoe.
- Critic updates remain non-private.
- The same gradient clipping and Gaussian noise are applied as in DQN.

Privacy parameters in experiments:

$\epsilon \in \{0.1,\,0.105,\,0.2,\,0.5,\,1.0,\,2.0,\,5.0,\,10.0,\infty\}$
$\delta = 10^{-5}$
$\sigma$ is computed using TensorFlow-Privacy’s RDP-to-DP conversion.

4. Quantitative Reward-Distance Metrics

PRIL assesses reconstruction error with four metrics, operating on reward vectors $R$ and $\hat R$ (normalized):

Metric	Formula	Description
$L_1$ distance	$d_1(R, \hat R) = \sum_{s \in S} \|R(s)-\hat R(s)\|$	Total variation
$L_2$ distance	$d_2(R, \hat R) = \left(\sum_{s \in S}(R(s)-\hat R(s))^2\right)^{1/2}$	Euclidean error
$L_\infty$ distance	$d_\infty(R, \hat R) = \max_{s \in S}\|R(s)-\hat R(s)\|$	Maximum deviation
Sign-change count	$d_{\rm sign}(R, \hat R) = \|\{s\in S: \mathrm{sign}(R(s))\ne\mathrm{sign}(\hat R(s))\}\|$	Reward sign flips across states

These metrics directly quantify the adversary’s ability to reconstruct the true underlying reward.

5. Empirical Evaluation: FrozenLake Benchmarks and Results

Experiments are carried out on 24 customized FrozenLake domains (12 of $5 \times 5$ and 12 of $10 \times 10$ grids), featuring cell types: Safe (S), Frozen (F), Hole (H), High-reward (A), and Goal (G), with near-deterministic transitions (slip factor $=0.0001$ ).

Algorithms Evaluated: VI-DP-Bellman, DQN-DP-SGD, DQN-DP-Shoe, DQN-DP-Adam, DQN-DP-FN, PPO-DP-SGD, PPO-DP-Shoe, PPO-DP-Adam, plus non-private baselines.
Training Protocols: DQN/PPO use 15 epochs, 200 iterations, batch size 50, micro-batches 5, learning rate 0.15, discount $\gamma = 0.99$ . VI runs Bellman updates until convergence.

Policy extraction and reward reconstruction are performed for each $\epsilon$ and algorithm over 10 random seeds. Distances $d_1$ , $d_2$ , $d_\infty$ , and $d_{\rm sign}$ are computed.

Key findings:

All four reconstruction error metrics are flat as a function of $\epsilon$ : increasing the privacy budget does not substantially decrease $d(R, \hat R'')$ .
Aggregated $L_2$ error ranks: DQN-variants $\approx$ PPO-variants $>$ VI-DP-Bellman. E.g., $d_2 \approx 20$ (DQN), $d_2 \approx 10$ (PPO), $d_2 \approx 5$ (VI-DP-Bellman) for $5\times5$ environments.
The utility–privacy trade-off is present for policy returns (lower $\epsilon$ reduces return), but reward reconstruction error remains insensitive to $\epsilon$ .

6. Privacy-Gap Analysis and Implications

A fundamental insight is that differential privacy mechanisms protecting policy updates—such as per-iteration gradient noise or DP-Bellman—do not guarantee privacy for the underlying reward function. Inverse RL attacks in PRIL can reconstruct $R$ from $\pi''$ with a small, constant error substantially independent of $\epsilon$ . Classical non-deep methods (VI-DP-Bellman) provide less reward privacy than deep RL approaches (DQN/PPO), yet the improvement is insufficient for privacy-critical domains.

This suggests a significant mismatch between the theoretical privacy guarantees (in terms of $(\epsilon,\delta)$ -DP applied during training) and the protection actually required to conceal the true reward. A plausible implication is that future DP-RL research must either develop reward-centric privacy mechanisms—directly privatizing $R$ —or redefine DP guarantees to explicitly bound reward-reconstruction error under adversarial inverse RL.

7. Summary and Directions

PRIL provides a rigorous framework for adversarial assessment of reward privacy in RL. It leverages finite-state LP inverse RL to reconstruct the reward from policies trained under various DP mechanisms and quantifies the privacy leakage via multiple reward-distance metrics. Empirical results clearly indicate a gap: prevailing DP-RL techniques do not prevent effective reward reconstruction, exposing the need for fundamentally stronger privacy definitions and mechanisms in RL policy deployment (Prakash et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

How Private Is Your RL Policy? An Inverse RL Based Analysis Framework (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Policy Reconstruction.