Policy Reconstruction in RL
- Policy Reconstruction is a set of adversarial techniques that use inverse RL to infer unknown reward functions from deployed RL policies.
- It quantifies reconstruction error using metrics like L1, L2, L∞ distances, demonstrating that standard DP methods may not effectively protect sensitive rewards.
- Empirical studies reveal that despite tighter privacy budgets, even DP-enhanced RL methods can leak reward details, urging the need for new privacy-centric mechanisms.
Policy reconstruction refers to adversarial techniques for extracting or estimating characteristics of an unknown reward function underlying a published or deployed reinforcement learning (RL) policy. In privacy-sensitive settings such as autonomous driving or recommendation systems, policies trained by RL may implicitly encode private preferences or objectives. "How Private Is Your RL Policy? An Inverse RL Based Analysis Framework" (Prakash et al., 2021) introduces a systematic framework—Privacy-Aware Inverse RL (PRIL)—that formalizes and operationalizes reward-reconstruction attacks against privacy-preserving RL policies. By quantifying the fidelity of reward recovery via inverse RL, the framework exposes substantial gaps between standard differential privacy guarantees applied to policy-training mechanisms and the effective protection of sensitive reward functions.
1. Formalization of the Reward-Reconstruction Attack
The PRIL framework defines the reward-reconstruction attack as an adversarial process: given a deployed RL policy trained on an unknown reward function , an adversary applies inverse RL to infer a reconstructed reward . The framework evaluates both non-private policies () and differentially-private policies (), measuring privacy via a set of reward-distance metrics between and the recovered :
- Non-private baseline: trained directly on .
- Private policy: trained on with a chosen differential privacy (DP) mechanism.
The attack pipeline is:
- Apply inverse RL to to obtain .
- Apply inverse RL to to obtain .
- Compute distances and using several metrics.
- If is large, the private policy offers strong reward privacy. Otherwise, is vulnerable to reconstruction.
2. Inverse RL Algorithm: Finite-State LP Formulation
Reward reconstruction leverages the classical finite-state Inverse RL algorithm of Ng and Russell (2000), instantiated as a linear program (LP) that finds the minimum-norm reward vector making the input policy uniquely optimal by at least a margin of 1. The optimization is:
- Objective:
- Optimality-by-margin constraints (for every state and every action ):
where
The full LP can be written as:
The LP returns as the reconstructed reward vector consistent with policy .
3. Differential Privacy in RL Algorithms
Three major classes of DP mechanisms are evaluated in PRIL:
- Value Iteration + DP-Bellman (VI-DP-Bellman):
- Gaussian noise is added to each Bellman update:
- Sensitivity for the update is . - Rényi-DP theory dictates:
Deep Q Network (DQN): DP-SGD, DP-Adam, DP-Shoe, DP-FN:
- Proximal Policy Optimization (PPO):
- Only the actor network updates are privatized using DP-SGD, DP-Adam, or DP-Shoe.
- Critic updates remain non-private.
- The same gradient clipping and Gaussian noise are applied as in DQN.
Privacy parameters in experiments:
- is computed using TensorFlow-Privacy’s RDP-to-DP conversion.
4. Quantitative Reward-Distance Metrics
PRIL assesses reconstruction error with four metrics, operating on reward vectors and (normalized):
| Metric | Formula | Description |
|---|---|---|
| distance | Total variation | |
| distance | Euclidean error | |
| distance | Maximum deviation | |
| Sign-change count | Reward sign flips across states |
These metrics directly quantify the adversary’s ability to reconstruct the true underlying reward.
5. Empirical Evaluation: FrozenLake Benchmarks and Results
Experiments are carried out on 24 customized FrozenLake domains (12 of and 12 of grids), featuring cell types: Safe (S), Frozen (F), Hole (H), High-reward (A), and Goal (G), with near-deterministic transitions (slip factor ).
- Algorithms Evaluated: VI-DP-Bellman, DQN-DP-SGD, DQN-DP-Shoe, DQN-DP-Adam, DQN-DP-FN, PPO-DP-SGD, PPO-DP-Shoe, PPO-DP-Adam, plus non-private baselines.
- Training Protocols: DQN/PPO use 15 epochs, 200 iterations, batch size 50, micro-batches 5, learning rate 0.15, discount . VI runs Bellman updates until convergence.
Policy extraction and reward reconstruction are performed for each and algorithm over 10 random seeds. Distances , , , and are computed.
Key findings:
- All four reconstruction error metrics are flat as a function of : increasing the privacy budget does not substantially decrease .
- Aggregated error ranks: DQN-variants PPO-variants VI-DP-Bellman. E.g., (DQN), (PPO), (VI-DP-Bellman) for environments.
- The utility–privacy trade-off is present for policy returns (lower reduces return), but reward reconstruction error remains insensitive to .
6. Privacy-Gap Analysis and Implications
A fundamental insight is that differential privacy mechanisms protecting policy updates—such as per-iteration gradient noise or DP-Bellman—do not guarantee privacy for the underlying reward function. Inverse RL attacks in PRIL can reconstruct from with a small, constant error substantially independent of . Classical non-deep methods (VI-DP-Bellman) provide less reward privacy than deep RL approaches (DQN/PPO), yet the improvement is insufficient for privacy-critical domains.
This suggests a significant mismatch between the theoretical privacy guarantees (in terms of -DP applied during training) and the protection actually required to conceal the true reward. A plausible implication is that future DP-RL research must either develop reward-centric privacy mechanisms—directly privatizing —or redefine DP guarantees to explicitly bound reward-reconstruction error under adversarial inverse RL.
7. Summary and Directions
PRIL provides a rigorous framework for adversarial assessment of reward privacy in RL. It leverages finite-state LP inverse RL to reconstruct the reward from policies trained under various DP mechanisms and quantifies the privacy leakage via multiple reward-distance metrics. Empirical results clearly indicate a gap: prevailing DP-RL techniques do not prevent effective reward reconstruction, exposing the need for fundamentally stronger privacy definitions and mechanisms in RL policy deployment (Prakash et al., 2021).