Soft-Partial Conservative Q-Learning

Updated 17 November 2025

SPaCQL is a novel offline multi-agent RL algorithm that uses Partial Action Replacement to control exponential distribution shifts in joint action spaces.
It integrates a conservative Q-learning objective with dynamic, uncertainty-aware weighting to balance policy deviations and stabilize value estimates.
Empirical results on MPE and MaMujoco benchmarks demonstrate SPaCQL’s tight value-error bounds and superior performance, especially in low coordination settings.

Soft-Partial Conservative Q-Learning (SPaCQL) is a principled offline multi-agent reinforcement learning (MARL) algorithm designed to address challenges arising from distributional shift and out-of-distribution (OOD) joint actions in factorized multi-agent datasets. By leveraging the concept of Partial Action Replacement (PAR) and integrating conservative value estimation, SPaCQL yields theoretical guarantees and improved empirical performance in environments where agent policies are partly or fully independent during data collection (Jin et al., 10 Nov 2025).

1. Foundations: Distribution Shift and Partial Action Replacement in MARL

Offline MARL faces severe extrapolation error due to the exponential size of the joint action space: when evaluating a new policy, standard Bellman backups sample actions for all agents, leading to full-joint extrapolation rarely covered by offline data, especially under independent or weakly coordinated data collection. PAR provides a sharp mitigation: instead of replacing the entire joint action in the target, it updates only a subset of agents’ actions (the deviators) with current policy samples, while keeping the other agents’ actions fixed to behavioral data.

Formally, for an $n$ -agent Dec-MDP and dataset $\D=\{(s,\a,r,s')\}$ collected under a factorized behavior policy $\mu(\a\mid s)=\prod_{i=1}^n\mu_i(a_i\mid s)$, the $k$ -PAR backup operator $\T^{(k)}$ replaces $k$ agents’ actions by policy-sampled values: $\T^{(k)}Q(s,\a) := \E_{(s',\a')\sim\D} \E_{\{\sigma_j\}\sim\Unif\left(\binom{[n]}{k}\right)} \E_{a'_{\sigma_j}\sim\pi_{\sigma_j}(\cdot|s')} \Bigl[ r + \gamma\,Q\left(s',\,\a'^{(k)}\right) \Bigr]$ where $\a'^{(k)}$ agrees with dataset action $\a'$ except on the $k$ replaced agent-indices $\{\sigma_j\}$ .

A soft-partial Bellman operator $\T^{\rm SP}$ is defined as a convex mixture: $\T^{\rm SP} Q = \sum_{k=1}^n w_k\,\T^{(k)}Q, \quad w_k\geq 0,\,\sum_k w_k=1$ Convexity and the $\gamma$ -contraction property of each $\T^{(k)}$ ensure contraction of $\T^{\rm SP}$.

The key implication is that, under a factorized data distribution, PAR controls OOD propagation: when only $k$ agents deviate, the distributional shift induced by the policy scales \emph{linearly} with $k$ and not exponentially with $n$ ((Jin et al., 10 Nov 2025), Lemma 3.1).

2. SPaCQL Objective: Bellman Backup and Conservative Regularization

SPaCQL extends the Conservative Q-Learning (CQL) framework (Kumar et al., 2020) to the multi-agent setting using the soft-partial operator in its Temporal Difference loss. Let $\{Q_{\theta_j}\}_{j=1}^M$ denote an ensemble of Q-functions with corresponding target networks $\{\bar Q_{\bar\theta_j}\}$ :

TD Loss with SPaCQL Target: $\L_{\rm TD}(\theta) = \E_{(s,\a,r,s')\sim\D}\left[\left(Q_\theta(s,\a)-Y_{\rm SP}(s,\a,r,s')\right)^2\right],$ where

$Y_{\rm SP} = r + \gamma\,\sum_{k=1}^n w_k\,\min_j \bar Q_{\bar\theta_j}(s',\a'^{(k)})$

A CQL-style conservative regularizer penalizes large Q-values on actions not present in the data: $\xi_c = \alpha\sum_{i=1}^n\bigg(\E_{s\sim\D,\,\a_{-i}\sim\D,\,a_i\sim\pi_i}\!\left[Q_\theta(s, a_i, a_{-i})\right] - \E_{(s,\a)\sim\D}\!\left[Q_\theta(s,\a)\right]\bigg)$ The full loss is: $\L(\theta) = \L_{\rm TD}(\theta) + \xi_c$

This approach preserves the conservative lower-bounding property of CQL while explicitly constraining OOD propagation through partial updates.

3. Theoretical Guarantees: Value-Error Bounds and Contraction Properties

SPaCQL provides a tight upper bound on value estimation error induced by deviation from the behavior policy under the factorization assumption. For a policy $\pi=(\pi_1,\ldots,\pi_n)$ and behavior $\mu=(\mu_1,\ldots,\mu_n)$ , define suboptimality $\varepsilon_{\rm Subopt} = \|Q^\pi-Q^*\|_\infty$ and FQI error $\varepsilon_{\rm FQI} = \|Q^*-\hat Q\|_\infty$ .

The \textbf{Linear-Divergence Bound}: $W_1\bigl(d^{(S)}, d^\mu\bigr) \leq \frac{\gamma}{1-\gamma}\sum_{i\in S}\TV\bigl(\pi_i,\mu_i\bigr)$ where $S$ is the set of deviating agents.

The \textbf{Tight Value-Error Bound}: $|V^\pi-\widehat V^\pi| \leq \varepsilon_{\rm Subopt} + \varepsilon_{\rm FQI} + \frac{4\gamma}{(1-\gamma)^2} \sum_{i=1}^n \TV(\pi_i, \mu_i)$ SPaCQL generalizes this with effective deviation

$k_{\rm eff}(s) = \sum_{k=1}^n w_k(s)k, \quad \overline{\TV} = \frac{1}{n}\sum_i\TV(\pi_i, \mu_i)$

yielding

$|V^\pi-\widehat V^\pi| \leq \varepsilon_{\rm Subopt} + \varepsilon_{\rm FQI} + \frac{4\gamma}{(1-\gamma)^2} \E_{s'\sim d^\pi}\left[k_{\rm eff}(s')\right]\, n\,\overline{\TV}$

Thus, the error grows \emph{linearly} with the number of deviating agents—a provable improvement over exponential scaling with full-joint deviation.

4. Dynamic Weighting: Uncertainty-Aware Partial Updates

SPaCQL adaptively selects the weights $w_k$ for mixing $k$ -PAR operators using an ensemble-based uncertainty quantification. For each $k$ , compute the variance over Q-functions: $u_k(s') = \sqrt{\Var_{j=1..M}\left[Q_{\theta_j}(s', \a'^{(k)})\right]}$ A softmax-inverse mapping assigns lower weight to more uncertain (i.e., likely OOD) partial updates: $w_k(s') = \frac{1/u_k(s')}{\sum_{\ell=1}^n 1/u_\ell(s')}, \quad u_k > 0$ The soft-partial target thus integrates the ensemble minimum Q-value with uncertainty scaling: $y_k = \frac{1}{u_k} \min_j \bar Q_{\bar\theta_j}(s', \a'^{(k)}), \qquad Y_{\rm SP} = r + \gamma \frac{\sum_k y_k}{\sum_k 1/u_k}$ This mechanism enables SPaCQL to interpolate between single-agent and joint updates as a function of data coverage and Q-value uncertainty.

5. Algorithmic Procedure

An outline of SPaCQL's training loop is as follows:

Algorithm SPaCQL
Input: offline buffer D, discount γ, reg α, ensemble size M

Initialize {Q_{θ_j}} and targets {Q̄_{θ̄_j}}, policies {π_i}
for each training iteration do
  Sample batch B={(s, a, r, s', a')} from D
  Initialize loss L←0
  for each (s,a,r,s',a') in B do
    for k=1..n do
      Sample k agent‐indices Σ=(σ₁..σ_k)
      Sample new actions {a^π_{σ}∼π_σ(·|s')}
      Form a'^{(k)} by replacing those components in a'
      Compute u_k←stddev_j[ Q_{θ_j}(s', a'^{(k)}) ]
      Compute ỹ_k ← (1/u_k) * min_j Q̄_{θ̄_j}(s',a'^{(k)})
    end for
    Y_SP ← r + γ * ( ∑_k ỹ_k ) / ( ∑_k 1/u_k )
    for j=1..M do
      L += ( Q_{θ_j}(s,a) - Y_SP )²
    end for
    L += CQL‐regularizer ξ_c(s,a)
  end for
  θ ← θ - η ∇_θ L
  Polyak‐update targets θ̄ ← τ θ + (1−τ) θ̄
  Update policies π_i to maximize Q_{θ₁}
end for
Output: joint policy π=(π₁,…,π_n)

Critical hyperparameters include the regularization strength $\alpha$ , ensemble size $M$ , batch size, learning rates, and number of Q-updates per step; these are analogous to their roles in CQL and ensemble deep RL approaches.

6. Empirical Evaluation and Results

SPaCQL was evaluated on Multi-Agent Particle Environment (MPE) benchmarks—Cooperative Navigation, Predator–Prey, World—and Multi-Agent MuJoCo (MaMujoco) Half-Cheetah, using datasets (Expert, Medium, Medium-Replay, Random) chosen to probe varying degrees of agent coordination. Baselines covered state-of-the-art offline MARL algorithms, including OMAR, MACQL (CFCQL), IQL, MA-TD3+BC, and DoF.

Results are summarized in the following table (mean ± std, normalized score):

Dataset	CFCQL	ICQL-QS	SPaCQL (best)
CN–Rand	62.2±8.1	77.7±?	78.2±14.0
World–Rand	68.0±20.8	89.9±?	94.3±7.4
Half-C–Rand	39.7±4.0	?	43.8±4.9
...	...	...	...

SPaCQL exhibited superior performance on Random and Medium-Replay datasets, which have low agent coordination and thus more severe OOD shifts. The advantage is attributed to the use of PAR and uncertainty-based weights, which prevent excessive extrapolation and stabilize learning.
On high-coordination Expert datasets, SPaCQL matched the best algorithms (DoF, CFCQL), indicating its adaptivity across coordination regimes.
Ablation studies indicate that SPaCQL automatically adjusts the soft-partial weights $w_k$ , emphasizing small- $k$ PAR (near-behavioral) on uncoordinated data and larger- $k$ PAR (greater policy deviation) on coordinated data. This dynamic trade-off directly reflects the effective $k_{\rm eff}(s)$ in the theoretical error bound.
A plausible implication is that the interpolation between single and joint updates provides both stability and flexibility absent in prior methods.

SPaCQL thereby achieves a provably tighter value–error bound and demonstrates empirical gains by aligning update structure to the independence pattern in offline datasets (Jin et al., 10 Nov 2025).