Counterfactual Conservative Q-Learning (CFCQL)

Updated 8 February 2026

The paper introduces counterfactual conservative regularization to mitigate exponential over-pessimism in offline MARL by penalizing per-agent OOD actions.
It integrates counterfactual Q-penalties into a TD learning framework, ensuring conservative value estimates and stable performance across varying agent counts.
Empirical evaluations show CFCQL outperforms baselines in diverse discrete and continuous multi-agent environments by effectively managing distribution shift.

Counterfactual Conservative Q-Learning (CFCQL) is an algorithm for offline multi-agent reinforcement learning (MARL) designed to address the severe extrapolation error and pessimism issues that arise due to distribution shift and the high-dimensionality of the joint action space. It provides a tractable and theoretically sound mechanism for conservative value estimation in multi-agent environments, building on the principles of single-agent Conservative Q-Learning (CQL) but adapting them with counterfactual, per-agent penalties to avoid the exponential over-pessimism characteristic of naïve extensions.

1. Problem Setting and Motivation

CFCQL operates within the decentralized partially observable Markov decision process (Dec-POMDP) formalism, characterized by a tuple $G = (S, A^n, P, r, O, Z, \gamma)$ with $n$ agents. At each timestep, each agent $i$ selects action $a^i$ based on its observation $o^i$ , forming a joint action $a = (a^1, ..., a^n)$ which transitions the system according to $P$ and yields a shared reward $r(s,a)$ . The algorithm is tailored for the offline RL regime: learning occurs entirely from a static dataset $D = \{(s,a,r,s')\}$ generated by an unknown behavior policy $\beta$ , with no further environment interactions.

Key challenges in this setting stem from distribution shift—policies $\pi$ encountered during evaluation can easily visit state-action pairs outside the support of $D$ , resulting in extrapolation-driven Q-value overestimation. This issue is magnified in the multi-agent case, where the joint action space $A^n$ grows exponentially with $n$ , meaning that nearly all joint actions are out-of-distribution (OOD) and prone to overestimation. Furthermore, value functions must respect the centralized training, decentralized execution (CTDE) paradigm, complicating the structure of conservative regularization.

2. Counterfactual Conservative Regularization

CFCQL generalizes the single-agent CQL paradigm, which adds a regularization term to the Q-function loss to penalize Q-values on OOD actions and reward on-dataset actions, resulting in an underestimation property. In the straightforward extension to multi-agent settings (denoted “MACQL”), one would penalize the Q-function on all OOD joint actions. However, this joint regularization term scales exponentially with the number of agents,

$D_{\mathrm{CQL}}(\pi, \beta)(s) = \mathbb{E}_{a\sim\pi}[\,\pi(a|s)/\beta(a|s) - 1\,],$

resulting in excessive pessimism and degraded performance as $n$ increases.

CFCQL introduces a counterfactual mechanism: for a fixed agent $i$ , the actions of all other agents are held at their dataset distribution $\beta^{-i}(a^{-i}|s)$ , and only agent $i$ ’s action is considered OOD. For agent $i$ , the regularization divergence is

$D^{\mathrm{cf}}_{\mathrm{CQL},i}(\pi, \beta)(s) = \mathbb{E}_{a^{-i}\sim \beta^{-i}}\left[\,\mathbb{E}_{a^i\sim\pi^i} \left[\,\pi^i(a^i|s)/\beta^i(a^i|s) -1\,\right]\,\right].$

The overall penalty is a weighted sum $\sum_{i=1}^n \lambda_i D^{\mathrm{cf}}_{\mathrm{CQL},i}(\pi, \beta)(s)$ , with nonnegative $\lambda_i$ summing to one. Practically, this decomposition yields regularization whose magnitude is independent of $n$ and avoids the over-conservatism of MACQL.

For entropy-regularized Q-functions, the per-agent penalty is operationalized as

$R^{\mathrm{cf},i}(Q; s) = \mathbb{E}_{a^{-i}\sim\beta^{-i}} \left[\log \sum_{a^i} \exp Q(s, a^i, a^{-i})\right] - \mathbb{E}_{a\sim\beta}[Q(s, a)].$

3. Algorithmic Objective and Implementation Details

The CFCQL objective integrates the counterfactual regularization into a standard temporal-difference (TD) learning framework:

$\text{Loss}(\theta) = \mathbb{E}_{(s,a,r,s')\sim D}\left[(Q^\theta(s,a) - y(s,a,r,s'))^2\right] + \alpha\ \mathbb{E}_{s\sim D}\left[\sum_{i=1}^n \lambda_i\, R^{\mathrm{cf},i}(Q^\theta; s)\right],$

where $y(s, a, r, s') = r + \gamma Q^{\theta^-}(s', a')$ is the target value, and $\theta^-$ denotes delayed parameters for stabilization (e.g., double Q-learning or MADDPG approaches).

The weighting coefficients $\lambda_i$ are determined adaptively as a softmax over each agent’s KL divergence between the current policy $\pi^i$ and the behavior policy $\beta^i$ :

$\lambda_i(s) \propto \exp\left(-\tau\, D_{\mathrm{KL}}(\pi^i(\cdot|s)\|\beta^i(\cdot|s))\right)$

with $\tau$ as a temperature parameter.

The procedure encompasses sampling minibatches, computing TD targets, evaluating per-agent penalties using $K$ samples from $a^{-i}\sim\beta^{-i}$ , updating critic parameters, synchronizing target networks, and—for continuous action settings—applying counterfactual policy improvement (PI) steps per agent.

4. Theoretical Guarantees

CFCQL provides several theoretical assurances:

Underestimation Property (Thm 4.1): The value function produced by CFCQL is provably conservative, i.e.

$\hat{V}^\pi(s) \le V^\pi(s) - \alpha\,\left[(I-\gamma P^\pi)^{-1} D^{\mathrm{cf}}_{\mathrm{CQL}}(\pi, \beta)\right](s)$

up to small sampling and estimation errors. Thus, increasing $\alpha$ increases the degree of pessimism, ensuring value underestimation for sufficiently large $\alpha$ .

Conservativeness Comparison (Thm 4.2): The counterfactual penalty

$0 \le D^{\mathrm{cf}}_{\mathrm{CQL}}(\pi, \beta)(s) \le D_{\mathrm{CQL}}(\pi, \beta)(s)$

and the ratio between full joint and counterfactual penalties grows exponentially in $n$ under policy divergence, demonstrating that CFCQL avoids MACQL’s over-pessimism.

Tight Safe Policy Improvement (Thms 4.3 & 4.4): For the solution $\pi_{\mathrm{CF}}^*$ to the penalized empirical objective, the return on the true MDP satisfies

$J(\pi_{\mathrm{CF}}^*, M) \ge J(\beta, M) - \zeta^{\mathrm{CF}},$

where $\zeta^{\mathrm{CF}} = O(\alpha/(1-\gamma)(1/\epsilon - 1) + \text{sampling error})$ and $\epsilon$ is a lower bound on behavior policy support. In contrast, MACQL’s bound scales like $1/\epsilon^n - 1$ , rapidly worsening for large $n$ .

5. Pseudocode and Training Pipeline

Below is the summarized high-level workflow for CFCQL (discrete and continuous action spaces):

Initialization: Set up central Q-network and target network. Estimate per-agent $\beta^i$ from $D$ using behavior cloning or VAE.
Minibatch Sampling: Draw transitions from $D$ .
TD Target Computation: For discrete, sample $a'\sim\pi(\cdot|s')$ ; for continuous, use current policies to select $a'$ .
TD Loss: Compute squared Bellman error.
Counterfactual Penalty: For each $(s,a)$ and agent $i$ , sample $K$ actions $a^{-i}$ from $\beta^{-i}$ and compute the log-sum-exp penalty.
Critic Update: Apply stochastic gradient descent on the combined TD and penalty loss.
Target Update: Periodic or soft update of target parameters.
Policy Improvement (continuous actions): Counterfactual policy gradients for each agent.
Adaptive Penalty Weights: Optionally update $\lambda_i(s)$ according to agent divergence.

6. Empirical Evaluation and Comparative Analysis

CFCQL was evaluated in diverse settings: Equal_Line (discrete), Multi-Agent Particle Environment (continuous), Multi-Agent MuJoCo (continuous), and StarCraft II micromanagement (discrete, CTDE/QMIX backbone). Datasets included Random, Medium, Expert, Medium-Replay, and Mixed. Competitor baselines encompassed independent offline RL methods (IQL, TD3+BC, AWAC) and CTDE offline MARL approaches (MACQL, MAICQ, OMAR, MADTKD, BC).

Key findings:

In Equal_Line, CFCQL’s policy return remains close to the behavior baseline as $n$ increases, while MACQL collapses.
On Multi-Agent Particle Environment and Multi-Agent MuJoCo, CFCQL outperforms baselines in 11/12 and 3/4 dataset splits, with especially strong results in low-quality (‘random’) data regimes.
On StarCraft II, CFCQL achieves highest win-rates in 14/16 combinations.

Ablations demonstrated that moderate values of $\tau$ in the $\lambda$ weighting are consistently optimal, increasing $\alpha$ improves performance in narrow (Expert) datasets, and the counterfactual policy improvement step is critical in continuous action cases. Performance scales smoothly and robustly as $|D|$ grows, supporting large-scale deployment.

7. Significance, Limitations, and Extensions

CFCQL delivers an approach to offline MARL that fuses the CTDE paradigm with principled, per-agent counterfactual regularization, yielding value estimates that are consistently pessimistic, avoid exponential scaling with agent count, and enable safe policy improvement bounds that remain tight for large $n$ . This distinctive blend provides a robust answer to the exacerbated distribution-shift and overestimation problems of offline MARL, as empirically validated on a suite of discrete and continuous multi-agent benchmarks (Shao et al., 2023).

Potential limitations include reliance on accurate per-agent behavior policy estimation and the computational overhead of repeated sampling for log-sum-exp evaluation. A plausible implication is that future work could explore scaling these principles to broader forms of agent interaction or partially observable dynamics, as well as more advanced behavior modeling strategies.

Markdown Report Issue Upgrade to Chat

References (1)

Counterfactual Conservative Q Learning for Offline Multi-agent Reinforcement Learning (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Counterfactual Conservative Q-Learning (CFCQL).