Critic Regularized Regression (CRR)
- CRR is an offline RL framework that uses a learned Q-function to filter actions based on advantage estimates for robust policy learning.
- It mitigates extrapolation errors by emphasizing high-value actions and ignoring suboptimal, out-of-distribution actions via value-filtered regression.
- Empirical evaluations show CRR achieves competitive performance against state-of-the-art methods in high-dimensional and complex RL tasks.
Critic Regularized Regression (CRR) is a framework for offline (batch) reinforcement learning (RL) that employs a critic-driven regularization in the policy learning phase to address extrapolation error and improve stability in policy optimization from static datasets. It combines policy evaluation under a fixed dataset with a value-filtered regression objective for the policy, leveraging a learned Q-function (the critic) to regularize which actions are imitated. CRR is closely related to a broader family of critic-regularized methods, with theoretical and empirical links to conservative Q-learning (CQL) and one-step advantage-weighted regression. Its design offers a practical and robust approach for learning policies from offline data, especially in high-dimensional and real-world RL tasks (Wang et al., 2020, Eysenbach et al., 2023).
1. Mathematical Formulation
CRR proceeds by alternately updating:
- The critic (Q-function), via distributional temporal-difference (TD) loss,
- The policy (actor), via a value-filtered regression objective that emphasizes actions supported by high Q-values.
For a fixed dataset of transitions sampled under an unknown behavioral policy , CRR employs:
Critic update: where is typically a cross-entropy or MSE, and are target network parameters.
Policy (actor) update: where with actions sampled from . The filter 0 is crucial and may be binary (1) or exponential (2). This weighting restricts imitation to actions judged superior by the critic (Wang et al., 2020).
2. Regularization Mechanism and Stability
CRR's core innovation is using the critic as a regularizer in the actor update, selectively copying actions from the dataset based on the critic's advantage estimate. This approach prevents the policy from imitating sub-optimal or out-of-distribution actions, thus mitigating extrapolation error commonly arising in offline RL, where the policy may propose actions not supported by the data, leading to uncontrolled and often over-optimistic Q-value estimates. Compared to standard policy gradients or behavioral cloning, CRR's value-weighted regression targets high-value actions, while ignoring (or downweighting) poor or out-of-distribution actions, naturally constraining the learned policy to regions well-supported by the dataset (Wang et al., 2020).
3. Policy Update Derivation and Algorithm
The exponential-filtered policy update in CRR represents a regularized policy optimization step: 3 where 4 is the empirical behavior policy. The analytic solution is
5
and the parametric policy 6 is optimized to minimize the cross-entropy between 7 and 8, implemented via samples weighted by 9 or 0 as described above (Wang et al., 2020). No additional regularization penalty is applied—critic-driven filtering acts as the only regularizer.
Algorithmic structure:
Each iteration consists of (1) sampling a minibatch, (2) one gradient step on the actor loss via filtered regression, (3) critic TD update, and (4) periodic target network synchronization.
4. Comparison with Other Critic Regularization Methods
CRR formalizes one approach within the broader genre of critic-regularized algorithms, with CQL as a prototypical example (Eysenbach et al., 2023). In CQL, the critic update includes an explicit penalty: 1 where 2 controls the regularization strength. The CRR and CQL scheme are tightly linked: for 3, CQL’s solution reduces to a one-step reverse-KL–like regularization equivalent to CRR’s policy objective. For intermediate 4, the resulting policies are nearly indistinguishable from those produced by one-step RL (argmax match >95% in discrete tabular domains for 5).
A summary of the connections:
| Method | Regularization Mechanism | Relation to CRR |
|---|---|---|
| CQL | Critic penalized for OOD actions | Equivalent to CRR for 6 |
| One-step RL | Single policy update, strong regularization | Special case of CQL/CRR, recovers similar policy |
| CRR | Value-filtered regression | General framework |
CRR avoids direct Q maximization via policy gradients and, through filtering, can be interpreted as a regularized policy improvement (with either hard or soft filters) (Wang et al., 2020, Eysenbach et al., 2023).
5. Network Architectures and Hyperparameters
CRR is implemented in deep RL settings using flexible architectures suited to the high-dimensional and partially observable tasks found in common benchmarks:
- Vision/proprioceptive stack: Small ResNet processing 7 camera views, concatenated with proprioceptive data, followed by a 4-block residual MLP (hidden size 1024, layernorm, ReLU)
- Value (critic) head: Linear output producing distributional 8-values (21 atoms over 9)
- Policy head: Mixture of Gaussians (5 components, mean/diagonal covariance); mean used at evaluation
- Recurrence: LSTM layers (size 1024) for egocentric/partially observable settings
Key hyperparameters:
- Adam optimizer, learning rates 0 (actor and critic)
- Batch size: 1024 (feed-forward), 128 (recurrent)
- Filter temperature 1
- Advantage sampler 2
- Critic atoms: 21 (CRR); target update every 100 steps
Filter selection is task-dependent: binary and binary-max excel on simple tasks, while the exponential filter is advantageous on complex, high-dimensional environments (Wang et al., 2020).
6. Empirical Performance and Ablations
CRR outperforms several state-of-the-art offline RL algorithms in both low- and high-dimensional domains:
- DeepMind Control Suite: CRR achieves normalized returns competitive with D4PG and ABM, surpassing BCQ and BC on control, manipulation, and locomotion tasks.
- Locomotion and Manipulation: CRR-variant returns (exp and bin) show significant gains, with mean success counts substantially higher than D4PG and BCQ, especially on vision-based and high-DoF manipulation.
- Qualitative ablations:
- Binary filters aggressively remove suboptimal actions, excelling on simple tasks with good coverage
- Exponential filter is more permissive; preferable with abundant, high-quality data in large state/action spaces
- Turning off policy noise at evaluation consistently improves final returns, especially with termination-sensitive tasks
- Critic-Weighted Policy (CWP): At test time, sampling actions according to 3 yields 2–5% further performance boosts
- K-step returns for advantage estimation or noise injection during evaluation can degrade offline policy quality
7. Lower Bounds and Theoretical Guarantees
CRR enjoys pessimistic value estimation guarantees, similar to those proven for CQL (Eysenbach et al., 2023): 4 given suitable realizability and convergence assumptions, meaning that the learned Q-function does not overestimate the value of the current policy. For 5, CRR recovers a behavior-regularized Q (i.e., 6), which is a lower bound for any policy 7 that remains close to 8. This property is essential to ensuring safe and robust deployment in offline settings prone to value overestimation. Empirically, in tasks demanding strong regularization or where extrapolation errors are prevalent, CRR and CQL produce similar and competitive performance (Eysenbach et al., 2023).
References:
- "Critic Regularized Regression" (Wang et al., 2020)
- "A Connection between One-Step Regularization and Critic Regularization in Reinforcement Learning" (Eysenbach et al., 2023)