Papers
Topics
Authors
Recent
Search
2000 character limit reached

Critic Regularized Regression (CRR)

Updated 28 April 2026
  • CRR is an offline RL framework that uses a learned Q-function to filter actions based on advantage estimates for robust policy learning.
  • It mitigates extrapolation errors by emphasizing high-value actions and ignoring suboptimal, out-of-distribution actions via value-filtered regression.
  • Empirical evaluations show CRR achieves competitive performance against state-of-the-art methods in high-dimensional and complex RL tasks.

Critic Regularized Regression (CRR) is a framework for offline (batch) reinforcement learning (RL) that employs a critic-driven regularization in the policy learning phase to address extrapolation error and improve stability in policy optimization from static datasets. It combines policy evaluation under a fixed dataset with a value-filtered regression objective for the policy, leveraging a learned Q-function (the critic) to regularize which actions are imitated. CRR is closely related to a broader family of critic-regularized methods, with theoretical and empirical links to conservative Q-learning (CQL) and one-step advantage-weighted regression. Its design offers a practical and robust approach for learning policies from offline data, especially in high-dimensional and real-world RL tasks (Wang et al., 2020, Eysenbach et al., 2023).

1. Mathematical Formulation

CRR proceeds by alternately updating:

  • The critic (Q-function), via distributional temporal-difference (TD) loss,
  • The policy (actor), via a value-filtered regression objective that emphasizes actions supported by high Q-values.

For a fixed dataset B\mathcal{B} of transitions (s,a,r,s)(s, a, r, s') sampled under an unknown behavioral policy β(as)\beta(a|s), CRR employs:

Critic update: Lcritic(θ)=E(s,a,r,s)B[D(Qθ(s,a), r+γEaπϕ(s)[Qθ(s,a)])]\mathcal{L}_{\rm critic}(\theta) = \mathbb{E}_{(s,a,r,s')\sim\mathcal{B}} \left[ D\big(Q_\theta(s,a),~ r + \gamma\, \mathbb{E}_{a'\sim\pi_{\phi'}(s')}\left[Q_{\theta'}(s',a')\right]\big) \right] where D(,)D(\cdot,\cdot) is typically a cross-entropy or MSE, and (θ,ϕ)(\theta', \phi') are target network parameters.

Policy (actor) update: maxϕ  E(s,a)B[f(A^θ(s,a))logπϕ(as)]\max_{\phi}\; \mathbb{E}_{(s,a)\sim\mathcal{B}}\left[ f\left(\hat A_\theta(s,a)\right)\, \log\pi_\phi(a|s) \right] where A^θ(s,a)=Qθ(s,a)1mj=1mQθ(s,a~j)\hat A_\theta(s,a) = Q_\theta(s,a) - \frac{1}{m}\sum_{j=1}^m Q_\theta(s, \tilde a_j) with actions a~j\tilde a_j sampled from πϕ(s)\pi_\phi(\cdot|s). The filter (s,a,r,s)(s, a, r, s')0 is crucial and may be binary ((s,a,r,s)(s, a, r, s')1) or exponential ((s,a,r,s)(s, a, r, s')2). This weighting restricts imitation to actions judged superior by the critic (Wang et al., 2020).

2. Regularization Mechanism and Stability

CRR's core innovation is using the critic as a regularizer in the actor update, selectively copying actions from the dataset based on the critic's advantage estimate. This approach prevents the policy from imitating sub-optimal or out-of-distribution actions, thus mitigating extrapolation error commonly arising in offline RL, where the policy may propose actions not supported by the data, leading to uncontrolled and often over-optimistic Q-value estimates. Compared to standard policy gradients or behavioral cloning, CRR's value-weighted regression targets high-value actions, while ignoring (or downweighting) poor or out-of-distribution actions, naturally constraining the learned policy to regions well-supported by the dataset (Wang et al., 2020).

3. Policy Update Derivation and Algorithm

The exponential-filtered policy update in CRR represents a regularized policy optimization step: (s,a,r,s)(s, a, r, s')3 where (s,a,r,s)(s, a, r, s')4 is the empirical behavior policy. The analytic solution is

(s,a,r,s)(s, a, r, s')5

and the parametric policy (s,a,r,s)(s, a, r, s')6 is optimized to minimize the cross-entropy between (s,a,r,s)(s, a, r, s')7 and (s,a,r,s)(s, a, r, s')8, implemented via samples weighted by (s,a,r,s)(s, a, r, s')9 or β(as)\beta(a|s)0 as described above (Wang et al., 2020). No additional regularization penalty is applied—critic-driven filtering acts as the only regularizer.

Algorithmic structure:

Each iteration consists of (1) sampling a minibatch, (2) one gradient step on the actor loss via filtered regression, (3) critic TD update, and (4) periodic target network synchronization.

4. Comparison with Other Critic Regularization Methods

CRR formalizes one approach within the broader genre of critic-regularized algorithms, with CQL as a prototypical example (Eysenbach et al., 2023). In CQL, the critic update includes an explicit penalty: β(as)\beta(a|s)1 where β(as)\beta(a|s)2 controls the regularization strength. The CRR and CQL scheme are tightly linked: for β(as)\beta(a|s)3, CQL’s solution reduces to a one-step reverse-KL–like regularization equivalent to CRR’s policy objective. For intermediate β(as)\beta(a|s)4, the resulting policies are nearly indistinguishable from those produced by one-step RL (argmax match >95% in discrete tabular domains for β(as)\beta(a|s)5).

A summary of the connections:

Method Regularization Mechanism Relation to CRR
CQL Critic penalized for OOD actions Equivalent to CRR for β(as)\beta(a|s)6
One-step RL Single policy update, strong regularization Special case of CQL/CRR, recovers similar policy
CRR Value-filtered regression General framework

CRR avoids direct Q maximization via policy gradients and, through filtering, can be interpreted as a regularized policy improvement (with either hard or soft filters) (Wang et al., 2020, Eysenbach et al., 2023).

5. Network Architectures and Hyperparameters

CRR is implemented in deep RL settings using flexible architectures suited to the high-dimensional and partially observable tasks found in common benchmarks:

  • Vision/proprioceptive stack: Small ResNet processing β(as)\beta(a|s)7 camera views, concatenated with proprioceptive data, followed by a 4-block residual MLP (hidden size 1024, layernorm, ReLU)
  • Value (critic) head: Linear output producing distributional β(as)\beta(a|s)8-values (21 atoms over β(as)\beta(a|s)9)
  • Policy head: Mixture of Gaussians (5 components, mean/diagonal covariance); mean used at evaluation
  • Recurrence: LSTM layers (size 1024) for egocentric/partially observable settings

Key hyperparameters:

  • Adam optimizer, learning rates Lcritic(θ)=E(s,a,r,s)B[D(Qθ(s,a), r+γEaπϕ(s)[Qθ(s,a)])]\mathcal{L}_{\rm critic}(\theta) = \mathbb{E}_{(s,a,r,s')\sim\mathcal{B}} \left[ D\big(Q_\theta(s,a),~ r + \gamma\, \mathbb{E}_{a'\sim\pi_{\phi'}(s')}\left[Q_{\theta'}(s',a')\right]\big) \right]0 (actor and critic)
  • Batch size: 1024 (feed-forward), 128 (recurrent)
  • Filter temperature Lcritic(θ)=E(s,a,r,s)B[D(Qθ(s,a), r+γEaπϕ(s)[Qθ(s,a)])]\mathcal{L}_{\rm critic}(\theta) = \mathbb{E}_{(s,a,r,s')\sim\mathcal{B}} \left[ D\big(Q_\theta(s,a),~ r + \gamma\, \mathbb{E}_{a'\sim\pi_{\phi'}(s')}\left[Q_{\theta'}(s',a')\right]\big) \right]1
  • Advantage sampler Lcritic(θ)=E(s,a,r,s)B[D(Qθ(s,a), r+γEaπϕ(s)[Qθ(s,a)])]\mathcal{L}_{\rm critic}(\theta) = \mathbb{E}_{(s,a,r,s')\sim\mathcal{B}} \left[ D\big(Q_\theta(s,a),~ r + \gamma\, \mathbb{E}_{a'\sim\pi_{\phi'}(s')}\left[Q_{\theta'}(s',a')\right]\big) \right]2
  • Critic atoms: 21 (CRR); target update every 100 steps

Filter selection is task-dependent: binary and binary-max excel on simple tasks, while the exponential filter is advantageous on complex, high-dimensional environments (Wang et al., 2020).

6. Empirical Performance and Ablations

CRR outperforms several state-of-the-art offline RL algorithms in both low- and high-dimensional domains:

  • DeepMind Control Suite: CRR achieves normalized returns competitive with D4PG and ABM, surpassing BCQ and BC on control, manipulation, and locomotion tasks.
  • Locomotion and Manipulation: CRR-variant returns (exp and bin) show significant gains, with mean success counts substantially higher than D4PG and BCQ, especially on vision-based and high-DoF manipulation.
  • Qualitative ablations:
    • Binary filters aggressively remove suboptimal actions, excelling on simple tasks with good coverage
    • Exponential filter is more permissive; preferable with abundant, high-quality data in large state/action spaces
    • Turning off policy noise at evaluation consistently improves final returns, especially with termination-sensitive tasks
    • Critic-Weighted Policy (CWP): At test time, sampling actions according to Lcritic(θ)=E(s,a,r,s)B[D(Qθ(s,a), r+γEaπϕ(s)[Qθ(s,a)])]\mathcal{L}_{\rm critic}(\theta) = \mathbb{E}_{(s,a,r,s')\sim\mathcal{B}} \left[ D\big(Q_\theta(s,a),~ r + \gamma\, \mathbb{E}_{a'\sim\pi_{\phi'}(s')}\left[Q_{\theta'}(s',a')\right]\big) \right]3 yields 2–5% further performance boosts
    • K-step returns for advantage estimation or noise injection during evaluation can degrade offline policy quality

7. Lower Bounds and Theoretical Guarantees

CRR enjoys pessimistic value estimation guarantees, similar to those proven for CQL (Eysenbach et al., 2023): Lcritic(θ)=E(s,a,r,s)B[D(Qθ(s,a), r+γEaπϕ(s)[Qθ(s,a)])]\mathcal{L}_{\rm critic}(\theta) = \mathbb{E}_{(s,a,r,s')\sim\mathcal{B}} \left[ D\big(Q_\theta(s,a),~ r + \gamma\, \mathbb{E}_{a'\sim\pi_{\phi'}(s')}\left[Q_{\theta'}(s',a')\right]\big) \right]4 given suitable realizability and convergence assumptions, meaning that the learned Q-function does not overestimate the value of the current policy. For Lcritic(θ)=E(s,a,r,s)B[D(Qθ(s,a), r+γEaπϕ(s)[Qθ(s,a)])]\mathcal{L}_{\rm critic}(\theta) = \mathbb{E}_{(s,a,r,s')\sim\mathcal{B}} \left[ D\big(Q_\theta(s,a),~ r + \gamma\, \mathbb{E}_{a'\sim\pi_{\phi'}(s')}\left[Q_{\theta'}(s',a')\right]\big) \right]5, CRR recovers a behavior-regularized Q (i.e., Lcritic(θ)=E(s,a,r,s)B[D(Qθ(s,a), r+γEaπϕ(s)[Qθ(s,a)])]\mathcal{L}_{\rm critic}(\theta) = \mathbb{E}_{(s,a,r,s')\sim\mathcal{B}} \left[ D\big(Q_\theta(s,a),~ r + \gamma\, \mathbb{E}_{a'\sim\pi_{\phi'}(s')}\left[Q_{\theta'}(s',a')\right]\big) \right]6), which is a lower bound for any policy Lcritic(θ)=E(s,a,r,s)B[D(Qθ(s,a), r+γEaπϕ(s)[Qθ(s,a)])]\mathcal{L}_{\rm critic}(\theta) = \mathbb{E}_{(s,a,r,s')\sim\mathcal{B}} \left[ D\big(Q_\theta(s,a),~ r + \gamma\, \mathbb{E}_{a'\sim\pi_{\phi'}(s')}\left[Q_{\theta'}(s',a')\right]\big) \right]7 that remains close to Lcritic(θ)=E(s,a,r,s)B[D(Qθ(s,a), r+γEaπϕ(s)[Qθ(s,a)])]\mathcal{L}_{\rm critic}(\theta) = \mathbb{E}_{(s,a,r,s')\sim\mathcal{B}} \left[ D\big(Q_\theta(s,a),~ r + \gamma\, \mathbb{E}_{a'\sim\pi_{\phi'}(s')}\left[Q_{\theta'}(s',a')\right]\big) \right]8. This property is essential to ensuring safe and robust deployment in offline settings prone to value overestimation. Empirically, in tasks demanding strong regularization or where extrapolation errors are prevalent, CRR and CQL produce similar and competitive performance (Eysenbach et al., 2023).


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Critic Regularized Regression (CRR).