Offline RL via Generative Critics

Updated 11 May 2026

The paper demonstrates that generative critics improve offline RL by modeling data distributions and mitigating bias in value estimation.
It integrates various architectures, including distributional, adversarial, latent-variable, and flow-based critics, to enforce support constraints and calibrate uncertainty.
Empirical results show significant performance gains on challenging benchmarks through improved policy guidance and dense intermediate supervision.

Offline reinforcement learning (RL) via generative critics refers to a set of algorithms that integrate generative models—such as probabilistic sequence models, adversarial networks, or flow-based representations—into the estimation of value functions (“critics”) and policy improvement, with the aim of addressing distributional mismatch, statistical bias, and sample inefficiency endemic to offline RL, where all policy/data interaction with the environment is strictly prohibited post-dataset. Modern instantiations span Return-Conditioned Supervised Learning (RCSL) with value-guided critics, generative adversarial critics enforcing support constraints, flow-matching critics scaling iterative compute, and generative latent-variable critics underpinning joint trajectory-return models.

1. Foundations and Motivation

Offline RL seeks to learn policies solely from a fixed batch of transition data $\mathcal{D}$ , collected by a possibly suboptimal or unknown behavior policy $\pi_\beta$ , in an MDP $(\mathcal{S},\mathcal{A},P,r,T)$ . The central challenge is reliable policy improvement without interacting with the environment and without querying the “true” return (or value) of rarely or never-seen state–action pairs. Generative critics address this by modeling the data distribution or the return-generating process in a manner that is robust to distributional shift and enables principled support or uncertainty estimation.

RCSL approaches—exemplified by Decision Transformers (DT)—focus on modeling the conditional distribution of actions given state–return-to-go pairs, but vanilla DT and relatives suffer when high returns seen in the dataset arise from stochastic “lucky” transitions or rare events, leading to misalignment between the conditioned return and the true expected return under the policy. Generative critics enable explicit modeling (or regularization) of the mapping from trajectories or actions to expected returns, thus bridging sequence modeling and value-based learning (Wang et al., 2023).

2. Generative Critic Architectures

Generative critics in offline RL fall into several principal categories:

a. Distributional Critics in RCSL Frameworks

“Critic-Guided Decision Transformer” (CGDT) augments the DT by introducing a distributional critic $Q_\phi(R_t | \tau_{0:t-1}, s_t, a_t)$ , parameterized as a Gaussian modeling the conditional distribution over future returns. Instead of Bellman backups, learning minimizes an asymmetric negative log-likelihood targeting high-return data, with bias parameter $\tau_c$ (Wang et al., 2023). The policy’s supervised loss is coupled with an expectile regression penalty, using the critic to align the policy’s action distribution with those whose estimated expected return matches or exceeds the specified return-to-go.

b. Generative Adversarial Critics with Support Constraints

DASCO (Dual-Generator Adversarial Support Constrained Offline RL) deploys two generators—a policy $G$ and an auxiliary generator $G_{aux}$ —whose mixture matches the dataset distribution. Adversarial training with a discriminator enforces that the policy cannot allocate probability outside the support of the data. Effectively, this yields in-support maximization of Q-values, allowing the policy to specialize on the “best” slice of the offline data without covering all suboptimal actions (Vuong et al., 2022).

c. Latent-Variable Joint Generative Critics

Generative Actor Critic (GAC) eschews explicit value function estimation, instead fitting a joint latent-variable generative model $p_\theta(\tau, y)$ over trajectories and returns (yields), using an autoregressive or variational inference framework. Policy improvement then reduces to inference queries—optimizing in latent space for exploitation, or posterior sampling for exploration under shifted return targets. This structure supports both robust offline optimization and smooth transfer to online fine-tuning contexts (Qin et al., 25 Dec 2025).

d. Flow-Matching Value Critics

floq parameterizes the Q-function as the solution of an explicit ODE (flow) in latent space, with a neural velocity field trained via flow-matching. Dense supervision at intermediate integration steps improves value learning, capacity scaling, and generalization properties relative to monolithic critics (Agrawalla et al., 8 Sep 2025).

3. Core Algorithms and Optimization Schemes

The following table contrasts representative generative critic approaches in offline RL.

Method	Critic Formulation	Policy Improvement/Query
CGDT (Wang et al., 2023)	Distributional critic (Gaussian) over returns	Expectile penalty, guided DT
DASCO (Vuong et al., 2022)	GAN with dual generators, TD Q-function	Discriminator-weighted Q opt
GAC (Qin et al., 25 Dec 2025)	Latent-variable joint $p(\tau, y)$	Latent inference (exploitation/exploration)
floq (Agrawalla et al., 8 Sep 2025)	ODE flow-matching Q-function	Max Q via flow integration

Distributional Critics and Guided Conditioning (CGDT):

Distributional value function $Q_\phi$ is trained via NLL with high-return bias ( $\pi_\beta$ 0).
Policy samples actions, then the critic’s mean $\pi_\beta$ 1 and variance $\pi_\beta$ 2 guide an expectile regression penalty.
Policy loss: $\pi_\beta$ 3.
Training alternates critic updating (maximum-likelihood) and policy updating with an increasing guidance coefficient $\pi_\beta$ 4.

Adversarial Generative Critics (DASCO):

Policy and auxiliary generator mixture are matched to the data by a GAN discriminator.
Policy is updated to maximize discriminator-weighted Q-values plus in-support log-probability regularizer.
Critic Q-function is trained by standard TD learning.
The dual-generator architecture achieves a theoretically optimal support constraint, preventing policy mass outside the dataset support.

Latent-Variable Joint Critics (GAC):

Full ELBO-based generative modeling approximates $\pi_\beta$ 5.
Policy improvement (exploitation) is performed in latent space by maximizing expected return under the critic’s return predictor, regularized by KL to the prior.
Exploration proceeds via posterior sampling conditioned on optimistic return targets, encouraging higher-return trajectory generation.
Same model serves both offline pretraining and online fine-tuning.

Flow-Matching Value Critics (floq):

Q-function is defined as the result of integrating a learned velocity field.
Training objective provides supervision at every intermediate integration step to regularize and maximize capacity.
The number of integration steps governs a tradeoff between expressivity and stability, enabling scalable critic capacity.

4. Empirical Results and Comparative Performance

All methods are benchmarked on challenging offline RL suites, particularly D4RL (locomotion, Antmaze, Maze2D), emphasizing both suboptimal and high-dimensional environments.

CGDT (Wang et al., 2023): Outperforms standard RCSL methods and competitive value-based methods (IQL, CQL) on both medium and umaze datasets, especially when reward feedback is sparse or data quality is poor. Demonstrates robust return tracking as a function of return-to-go conditioning.
DASCO (Vuong et al., 2022): Achieves superior performance on AntMaze with noisy and biased data, and matches or exceeds other methods on Mujoco locomotion. Ablations confirm the necessity of the auxiliary generator for stability and in-support optimization.
GAC (Qin et al., 25 Dec 2025): Surpasses strong sequence model and actor-critic baselines in both offline pretraining and offline-to-online fine-tuning, including tasks with only total return feedback. Structured latent spaces correspond to semantically meaningful behavior clusters.
floq (Agrawalla et al., 8 Sep 2025): Yields up to 1.8× success rate gains over monolithic or ensemble critics on extremely long-horizon/sparse-reward tasks, highlighting the value of dense intermediate supervision and scalable computation.

5. Theoretical Insights and Support Constraints

A common thread among generative critic methods is the explicit or implicit enforcement of support constraints and uncertainty/certainty calibration in the face of limited data:

DASCO (Vuong et al., 2022): The dual-generator structure guarantees that the learned policy cannot allocate mass outside $\pi_\beta$ 6, ensuring reliable value estimation and preventing over-optimistic Q extrapolation.
CGDT (Wang et al., 2023): The distributional critic aligns action selection with the expected value, mitigating the bias induced by treating rare, stochastic high returns as ground truth during imitation training.
GAC (Qin et al., 25 Dec 2025): By modeling the entire trajectory–return joint distribution, the critic supports robust inference under novel control objectives, and the structured latent bottleneck enables both exploitation and exploration.
floq (Agrawalla et al., 8 Sep 2025): Flow-matching imposes regularity on the Q-function, enabling improved generalization and stability via dense stepwise supervision.

6. Limitations and Open Directions

Generative critics in offline RL, while advancing state-of-the-art performance and reliability, present several open challenges:

Hyperparameter Sensitivity: Guidance coefficients ( $\pi_\beta$ 7), expectile and critic bias ( $\pi_\beta$ 8), mixture weights, and latent space priors require tuning and may be dataset/task-dependent (Wang et al., 2023, Vuong et al., 2022, Qin et al., 25 Dec 2025).
Out-of-Distribution Action Risk: Overestimation via the critic can persist if value prediction is inaccurate or the mixture constraint is insufficiently tight, demanding better critic ensembles or calibration.
Scalability: While flow-matching and latent-variable models offer capacity scaling, excessive complexity or length can induce instability or intractability (Agrawalla et al., 8 Sep 2025).
Generalization Beyond Return-Level Supervision: GAC and similar frameworks excel with episodic returns, but dense, stepwise feedback or model-based planning incorporation remains a frontier (Qin et al., 25 Dec 2025).

Future work includes adaptive regularization, critic ensembles, hybrid model-based extensions, and deeper integration of sequence models or diffusion-style flows for uncertainty quantification and policy robustness.

7. Relationships to Broader Offline RL Literature

Generative critic approaches are distinguished from classic distributional regularization (e.g., KL or BCQ/BRAC family) and conservative/pessimistic Q-learning by their explicit modeling and enforcement of support and return structure (Vuong et al., 2022, Kostrikov et al., 2021). They are orthogonal to energy-based or score-matching regularization used in Fisher-BRC, which focuses on controlling the gradient of the value function for behavior-cloned stability (Kostrikov et al., 2021). The move toward generative (flow, adversarial, and latent-variable) architectures reflects a convergence of generative modeling and RL, leveraging advances in deep sequence modeling and probabilistic inference for more reliable offline policy optimization.