Papers
Topics
Authors
Recent
Search
2000 character limit reached

Counterfactual Conservative Q-Learning (CFCQL)

Updated 8 February 2026
  • The paper introduces counterfactual conservative regularization to mitigate exponential over-pessimism in offline MARL by penalizing per-agent OOD actions.
  • It integrates counterfactual Q-penalties into a TD learning framework, ensuring conservative value estimates and stable performance across varying agent counts.
  • Empirical evaluations show CFCQL outperforms baselines in diverse discrete and continuous multi-agent environments by effectively managing distribution shift.

Counterfactual Conservative Q-Learning (CFCQL) is an algorithm for offline multi-agent reinforcement learning (MARL) designed to address the severe extrapolation error and pessimism issues that arise due to distribution shift and the high-dimensionality of the joint action space. It provides a tractable and theoretically sound mechanism for conservative value estimation in multi-agent environments, building on the principles of single-agent Conservative Q-Learning (CQL) but adapting them with counterfactual, per-agent penalties to avoid the exponential over-pessimism characteristic of naïve extensions.

1. Problem Setting and Motivation

CFCQL operates within the decentralized partially observable Markov decision process (Dec-POMDP) formalism, characterized by a tuple G=(S,An,P,r,O,Z,γ)G = (S, A^n, P, r, O, Z, \gamma) with nn agents. At each timestep, each agent ii selects action aia^i based on its observation oio^i, forming a joint action a=(a1,...,an)a = (a^1, ..., a^n) which transitions the system according to PP and yields a shared reward r(s,a)r(s,a). The algorithm is tailored for the offline RL regime: learning occurs entirely from a static dataset D={(s,a,r,s)}D = \{(s,a,r,s')\} generated by an unknown behavior policy β\beta, with no further environment interactions.

Key challenges in this setting stem from distribution shift—policies π\pi encountered during evaluation can easily visit state-action pairs outside the support of DD, resulting in extrapolation-driven Q-value overestimation. This issue is magnified in the multi-agent case, where the joint action space AnA^n grows exponentially with nn, meaning that nearly all joint actions are out-of-distribution (OOD) and prone to overestimation. Furthermore, value functions must respect the centralized training, decentralized execution (CTDE) paradigm, complicating the structure of conservative regularization.

2. Counterfactual Conservative Regularization

CFCQL generalizes the single-agent CQL paradigm, which adds a regularization term to the Q-function loss to penalize Q-values on OOD actions and reward on-dataset actions, resulting in an underestimation property. In the straightforward extension to multi-agent settings (denoted “MACQL”), one would penalize the Q-function on all OOD joint actions. However, this joint regularization term scales exponentially with the number of agents,

DCQL(π,β)(s)=Eaπ[π(as)/β(as)1],D_{\mathrm{CQL}}(\pi, \beta)(s) = \mathbb{E}_{a\sim\pi}[\,\pi(a|s)/\beta(a|s) - 1\,],

resulting in excessive pessimism and degraded performance as nn increases.

CFCQL introduces a counterfactual mechanism: for a fixed agent ii, the actions of all other agents are held at their dataset distribution βi(ais)\beta^{-i}(a^{-i}|s), and only agent ii’s action is considered OOD. For agent ii, the regularization divergence is

DCQL,icf(π,β)(s)=Eaiβi[Eaiπi[πi(ais)/βi(ais)1]].D^{\mathrm{cf}}_{\mathrm{CQL},i}(\pi, \beta)(s) = \mathbb{E}_{a^{-i}\sim \beta^{-i}}\left[\,\mathbb{E}_{a^i\sim\pi^i} \left[\,\pi^i(a^i|s)/\beta^i(a^i|s) -1\,\right]\,\right].

The overall penalty is a weighted sum i=1nλiDCQL,icf(π,β)(s)\sum_{i=1}^n \lambda_i D^{\mathrm{cf}}_{\mathrm{CQL},i}(\pi, \beta)(s), with nonnegative λi\lambda_i summing to one. Practically, this decomposition yields regularization whose magnitude is independent of nn and avoids the over-conservatism of MACQL.

For entropy-regularized Q-functions, the per-agent penalty is operationalized as

Rcf,i(Q;s)=Eaiβi[logaiexpQ(s,ai,ai)]Eaβ[Q(s,a)].R^{\mathrm{cf},i}(Q; s) = \mathbb{E}_{a^{-i}\sim\beta^{-i}} \left[\log \sum_{a^i} \exp Q(s, a^i, a^{-i})\right] - \mathbb{E}_{a\sim\beta}[Q(s, a)].

3. Algorithmic Objective and Implementation Details

The CFCQL objective integrates the counterfactual regularization into a standard temporal-difference (TD) learning framework:

Loss(θ)=E(s,a,r,s)D[(Qθ(s,a)y(s,a,r,s))2]+α EsD[i=1nλiRcf,i(Qθ;s)],\text{Loss}(\theta) = \mathbb{E}_{(s,a,r,s')\sim D}\left[(Q^\theta(s,a) - y(s,a,r,s'))^2\right] + \alpha\ \mathbb{E}_{s\sim D}\left[\sum_{i=1}^n \lambda_i\, R^{\mathrm{cf},i}(Q^\theta; s)\right],

where y(s,a,r,s)=r+γQθ(s,a)y(s, a, r, s') = r + \gamma Q^{\theta^-}(s', a') is the target value, and θ\theta^- denotes delayed parameters for stabilization (e.g., double Q-learning or MADDPG approaches).

The weighting coefficients λi\lambda_i are determined adaptively as a softmax over each agent’s KL divergence between the current policy πi\pi^i and the behavior policy βi\beta^i:

λi(s)exp(τDKL(πi(s)βi(s)))\lambda_i(s) \propto \exp\left(-\tau\, D_{\mathrm{KL}}(\pi^i(\cdot|s)\|\beta^i(\cdot|s))\right)

with τ\tau as a temperature parameter.

The procedure encompasses sampling minibatches, computing TD targets, evaluating per-agent penalties using KK samples from aiβia^{-i}\sim\beta^{-i}, updating critic parameters, synchronizing target networks, and—for continuous action settings—applying counterfactual policy improvement (PI) steps per agent.

4. Theoretical Guarantees

CFCQL provides several theoretical assurances:

  • Underestimation Property (Thm 4.1): The value function produced by CFCQL is provably conservative, i.e.

V^π(s)Vπ(s)α[(IγPπ)1DCQLcf(π,β)](s)\hat{V}^\pi(s) \le V^\pi(s) - \alpha\,\left[(I-\gamma P^\pi)^{-1} D^{\mathrm{cf}}_{\mathrm{CQL}}(\pi, \beta)\right](s)

up to small sampling and estimation errors. Thus, increasing α\alpha increases the degree of pessimism, ensuring value underestimation for sufficiently large α\alpha.

  • Conservativeness Comparison (Thm 4.2): The counterfactual penalty

0DCQLcf(π,β)(s)DCQL(π,β)(s)0 \le D^{\mathrm{cf}}_{\mathrm{CQL}}(\pi, \beta)(s) \le D_{\mathrm{CQL}}(\pi, \beta)(s)

and the ratio between full joint and counterfactual penalties grows exponentially in nn under policy divergence, demonstrating that CFCQL avoids MACQL’s over-pessimism.

  • Tight Safe Policy Improvement (Thms 4.3 & 4.4): For the solution πCF\pi_{\mathrm{CF}}^* to the penalized empirical objective, the return on the true MDP satisfies

J(πCF,M)J(β,M)ζCF,J(\pi_{\mathrm{CF}}^*, M) \ge J(\beta, M) - \zeta^{\mathrm{CF}},

where ζCF=O(α/(1γ)(1/ϵ1)+sampling error)\zeta^{\mathrm{CF}} = O(\alpha/(1-\gamma)(1/\epsilon - 1) + \text{sampling error}) and ϵ\epsilon is a lower bound on behavior policy support. In contrast, MACQL’s bound scales like 1/ϵn11/\epsilon^n - 1, rapidly worsening for large nn.

5. Pseudocode and Training Pipeline

Below is the summarized high-level workflow for CFCQL (discrete and continuous action spaces):

  1. Initialization: Set up central Q-network and target network. Estimate per-agent βi\beta^i from DD using behavior cloning or VAE.
  2. Minibatch Sampling: Draw transitions from DD.
  3. TD Target Computation: For discrete, sample aπ(s)a'\sim\pi(\cdot|s'); for continuous, use current policies to select aa'.
  4. TD Loss: Compute squared Bellman error.
  5. Counterfactual Penalty: For each (s,a)(s,a) and agent ii, sample KK actions aia^{-i} from βi\beta^{-i} and compute the log-sum-exp penalty.
  6. Critic Update: Apply stochastic gradient descent on the combined TD and penalty loss.
  7. Target Update: Periodic or soft update of target parameters.
  8. Policy Improvement (continuous actions): Counterfactual policy gradients for each agent.
  9. Adaptive Penalty Weights: Optionally update λi(s)\lambda_i(s) according to agent divergence.

6. Empirical Evaluation and Comparative Analysis

CFCQL was evaluated in diverse settings: Equal_Line (discrete), Multi-Agent Particle Environment (continuous), Multi-Agent MuJoCo (continuous), and StarCraft II micromanagement (discrete, CTDE/QMIX backbone). Datasets included Random, Medium, Expert, Medium-Replay, and Mixed. Competitor baselines encompassed independent offline RL methods (IQL, TD3+BC, AWAC) and CTDE offline MARL approaches (MACQL, MAICQ, OMAR, MADTKD, BC).

Key findings:

  • In Equal_Line, CFCQL’s policy return remains close to the behavior baseline as nn increases, while MACQL collapses.
  • On Multi-Agent Particle Environment and Multi-Agent MuJoCo, CFCQL outperforms baselines in 11/12 and 3/4 dataset splits, with especially strong results in low-quality (‘random’) data regimes.
  • On StarCraft II, CFCQL achieves highest win-rates in 14/16 combinations.

Ablations demonstrated that moderate values of τ\tau in the λ\lambda weighting are consistently optimal, increasing α\alpha improves performance in narrow (Expert) datasets, and the counterfactual policy improvement step is critical in continuous action cases. Performance scales smoothly and robustly as D|D| grows, supporting large-scale deployment.

7. Significance, Limitations, and Extensions

CFCQL delivers an approach to offline MARL that fuses the CTDE paradigm with principled, per-agent counterfactual regularization, yielding value estimates that are consistently pessimistic, avoid exponential scaling with agent count, and enable safe policy improvement bounds that remain tight for large nn. This distinctive blend provides a robust answer to the exacerbated distribution-shift and overestimation problems of offline MARL, as empirically validated on a suite of discrete and continuous multi-agent benchmarks (Shao et al., 2023).

Potential limitations include reliance on accurate per-agent behavior policy estimation and the computational overhead of repeated sampling for log-sum-exp evaluation. A plausible implication is that future work could explore scaling these principles to broader forms of agent interaction or partially observable dynamics, as well as more advanced behavior modeling strategies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Counterfactual Conservative Q-Learning (CFCQL).