Papers
Topics
Authors
Recent
2000 character limit reached

Soft-Partial Conservative Q-Learning

Updated 17 November 2025
  • SPaCQL is a novel offline multi-agent RL algorithm that uses Partial Action Replacement to control exponential distribution shifts in joint action spaces.
  • It integrates a conservative Q-learning objective with dynamic, uncertainty-aware weighting to balance policy deviations and stabilize value estimates.
  • Empirical results on MPE and MaMujoco benchmarks demonstrate SPaCQL’s tight value-error bounds and superior performance, especially in low coordination settings.

Soft-Partial Conservative Q-Learning (SPaCQL) is a principled offline multi-agent reinforcement learning (MARL) algorithm designed to address challenges arising from distributional shift and out-of-distribution (OOD) joint actions in factorized multi-agent datasets. By leveraging the concept of Partial Action Replacement (PAR) and integrating conservative value estimation, SPaCQL yields theoretical guarantees and improved empirical performance in environments where agent policies are partly or fully independent during data collection (Jin et al., 10 Nov 2025).

1. Foundations: Distribution Shift and Partial Action Replacement in MARL

Offline MARL faces severe extrapolation error due to the exponential size of the joint action space: when evaluating a new policy, standard Bellman backups sample actions for all agents, leading to full-joint extrapolation rarely covered by offline data, especially under independent or weakly coordinated data collection. PAR provides a sharp mitigation: instead of replacing the entire joint action in the target, it updates only a subset of agents’ actions (the deviators) with current policy samples, while keeping the other agents’ actions fixed to behavioral data.

Formally, for an nn-agent Dec-MDP and dataset $\D=\{(s,\a,r,s')\}$ collected under a factorized behavior policy $\mu(\a\mid s)=\prod_{i=1}^n\mu_i(a_i\mid s)$, the kk-PAR backup operator $\T^{(k)}$ replaces kk agents’ actions by policy-sampled values: $\T^{(k)}Q(s,\a) := \E_{(s',\a')\sim\D} \E_{\{\sigma_j\}\sim\Unif\left(\binom{[n]}{k}\right)} \E_{a'_{\sigma_j}\sim\pi_{\sigma_j}(\cdot|s')} \Bigl[ r + \gamma\,Q\left(s',\,\a'^{(k)}\right) \Bigr]$ where $\a'^{(k)}$ agrees with dataset action $\a'$ except on the kk replaced agent-indices {σj}\{\sigma_j\}.

A soft-partial Bellman operator $\T^{\rm SP}$ is defined as a convex mixture: $\T^{\rm SP} Q = \sum_{k=1}^n w_k\,\T^{(k)}Q, \quad w_k\geq 0,\,\sum_k w_k=1$ Convexity and the γ\gamma-contraction property of each $\T^{(k)}$ ensure contraction of $\T^{\rm SP}$.

The key implication is that, under a factorized data distribution, PAR controls OOD propagation: when only kk agents deviate, the distributional shift induced by the policy scales \emph{linearly} with kk and not exponentially with nn ((Jin et al., 10 Nov 2025), Lemma 3.1).

2. SPaCQL Objective: Bellman Backup and Conservative Regularization

SPaCQL extends the Conservative Q-Learning (CQL) framework (Kumar et al., 2020) to the multi-agent setting using the soft-partial operator in its Temporal Difference loss. Let {Qθj}j=1M\{Q_{\theta_j}\}_{j=1}^M denote an ensemble of Q-functions with corresponding target networks {Qˉθˉj}\{\bar Q_{\bar\theta_j}\}:

TD Loss with SPaCQL Target: $\L_{\rm TD}(\theta) = \E_{(s,\a,r,s')\sim\D}\left[\left(Q_\theta(s,\a)-Y_{\rm SP}(s,\a,r,s')\right)^2\right],$ where

$Y_{\rm SP} = r + \gamma\,\sum_{k=1}^n w_k\,\min_j \bar Q_{\bar\theta_j}(s',\a'^{(k)})$

A CQL-style conservative regularizer penalizes large Q-values on actions not present in the data: $\xi_c = \alpha\sum_{i=1}^n\bigg(\E_{s\sim\D,\,\a_{-i}\sim\D,\,a_i\sim\pi_i}\!\left[Q_\theta(s, a_i, a_{-i})\right] - \E_{(s,\a)\sim\D}\!\left[Q_\theta(s,\a)\right]\bigg)$ The full loss is: $\L(\theta) = \L_{\rm TD}(\theta) + \xi_c$

This approach preserves the conservative lower-bounding property of CQL while explicitly constraining OOD propagation through partial updates.

3. Theoretical Guarantees: Value-Error Bounds and Contraction Properties

SPaCQL provides a tight upper bound on value estimation error induced by deviation from the behavior policy under the factorization assumption. For a policy π=(π1,,πn)\pi=(\pi_1,\ldots,\pi_n) and behavior μ=(μ1,,μn)\mu=(\mu_1,\ldots,\mu_n), define suboptimality εSubopt=QπQ\varepsilon_{\rm Subopt} = \|Q^\pi-Q^*\|_\infty and FQI error εFQI=QQ^\varepsilon_{\rm FQI} = \|Q^*-\hat Q\|_\infty.

The \textbf{Linear-Divergence Bound}: $W_1\bigl(d^{(S)}, d^\mu\bigr) \leq \frac{\gamma}{1-\gamma}\sum_{i\in S}\TV\bigl(\pi_i,\mu_i\bigr)$ where SS is the set of deviating agents.

The \textbf{Tight Value-Error Bound}: $|V^\pi-\widehat V^\pi| \leq \varepsilon_{\rm Subopt} + \varepsilon_{\rm FQI} + \frac{4\gamma}{(1-\gamma)^2} \sum_{i=1}^n \TV(\pi_i, \mu_i)$ SPaCQL generalizes this with effective deviation

$k_{\rm eff}(s) = \sum_{k=1}^n w_k(s)k, \quad \overline{\TV} = \frac{1}{n}\sum_i\TV(\pi_i, \mu_i)$

yielding

$|V^\pi-\widehat V^\pi| \leq \varepsilon_{\rm Subopt} + \varepsilon_{\rm FQI} + \frac{4\gamma}{(1-\gamma)^2} \E_{s'\sim d^\pi}\left[k_{\rm eff}(s')\right]\, n\,\overline{\TV}$

Thus, the error grows \emph{linearly} with the number of deviating agents—a provable improvement over exponential scaling with full-joint deviation.

4. Dynamic Weighting: Uncertainty-Aware Partial Updates

SPaCQL adaptively selects the weights wkw_k for mixing kk-PAR operators using an ensemble-based uncertainty quantification. For each kk, compute the variance over Q-functions: $u_k(s') = \sqrt{\Var_{j=1..M}\left[Q_{\theta_j}(s', \a'^{(k)})\right]}$ A softmax-inverse mapping assigns lower weight to more uncertain (i.e., likely OOD) partial updates: wk(s)=1/uk(s)=1n1/u(s),uk>0w_k(s') = \frac{1/u_k(s')}{\sum_{\ell=1}^n 1/u_\ell(s')}, \quad u_k > 0 The soft-partial target thus integrates the ensemble minimum Q-value with uncertainty scaling: $y_k = \frac{1}{u_k} \min_j \bar Q_{\bar\theta_j}(s', \a'^{(k)}), \qquad Y_{\rm SP} = r + \gamma \frac{\sum_k y_k}{\sum_k 1/u_k}$ This mechanism enables SPaCQL to interpolate between single-agent and joint updates as a function of data coverage and Q-value uncertainty.

5. Algorithmic Procedure

An outline of SPaCQL's training loop is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Algorithm SPaCQL
Input: offline buffer D, discount γ, reg α, ensemble size M

Initialize {Q_{θ_j}} and targets {Q̄_{θ̄_j}}, policies {π_i}
for each training iteration do
  Sample batch B={(s, a, r, s', a')} from D
  Initialize loss L←0
  for each (s,a,r,s',a') in B do
    for k=1..n do
      Sample k agent‐indices Σ=(σ₁..σ_k)
      Sample new actions {a^π_{σ}∼π_σ(·|s')}
      Form a'^{(k)} by replacing those components in a'
      Compute u_k←stddev_j[ Q_{θ_j}(s', a'^{(k)}) ]
      Compute ỹ_k ← (1/u_k) * min_j Q̄_{θ̄_j}(s',a'^{(k)})
    end for
    Y_SP ← r + γ * ( ∑_k ỹ_k ) / ( ∑_k 1/u_k )
    for j=1..M do
      L += ( Q_{θ_j}(s,a) - Y_SP )²
    end for
    L += CQL‐regularizer ξ_c(s,a)
  end for
  θ ← θ - η ∇_θ L
  Polyak‐update targets θ̄ ← τ θ + (1−τ) θ̄
  Update policies π_i to maximize Q_{θ₁}
end for
Output: joint policy π=(π₁,…,π_n)

Critical hyperparameters include the regularization strength α\alpha, ensemble size MM, batch size, learning rates, and number of Q-updates per step; these are analogous to their roles in CQL and ensemble deep RL approaches.

6. Empirical Evaluation and Results

SPaCQL was evaluated on Multi-Agent Particle Environment (MPE) benchmarks—Cooperative Navigation, Predator–Prey, World—and Multi-Agent MuJoCo (MaMujoco) Half-Cheetah, using datasets (Expert, Medium, Medium-Replay, Random) chosen to probe varying degrees of agent coordination. Baselines covered state-of-the-art offline MARL algorithms, including OMAR, MACQL (CFCQL), IQL, MA-TD3+BC, and DoF.

Results are summarized in the following table (mean ± std, normalized score):

Dataset CFCQL ICQL-QS SPaCQL (best)
CN–Rand 62.2±8.1 77.7±? 78.2±14.0
World–Rand 68.0±20.8 89.9±? 94.3±7.4
Half-C–Rand 39.7±4.0 ? 43.8±4.9
... ... ... ...
  • SPaCQL exhibited superior performance on Random and Medium-Replay datasets, which have low agent coordination and thus more severe OOD shifts. The advantage is attributed to the use of PAR and uncertainty-based weights, which prevent excessive extrapolation and stabilize learning.
  • On high-coordination Expert datasets, SPaCQL matched the best algorithms (DoF, CFCQL), indicating its adaptivity across coordination regimes.
  • Ablation studies indicate that SPaCQL automatically adjusts the soft-partial weights wkw_k, emphasizing small-kk PAR (near-behavioral) on uncoordinated data and larger-kk PAR (greater policy deviation) on coordinated data. This dynamic trade-off directly reflects the effective keff(s)k_{\rm eff}(s) in the theoretical error bound.
  • A plausible implication is that the interpolation between single and joint updates provides both stability and flexibility absent in prior methods.

SPaCQL thereby achieves a provably tighter value–error bound and demonstrates empirical gains by aligning update structure to the independence pattern in offline datasets (Jin et al., 10 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Soft-Partial Conservative Q-Learning (SPaCQL).