Bayesian Strong Stackelberg Q-learning

Updated 18 June 2026

Bayesian Strong Stackelberg Q-learning (BSS-Q) is a multi-agent reinforcement learning method that computes optimal leader policies in games under uncertainty.
It iteratively approximates Strong Stackelberg Equilibria by updating type-conditional Q-tables and beliefs using online Bayesian updates.
BSS-Q ensures convergence to equilibrium strategies in sequential adversarial settings and offers scalable solutions for cybersecurity Moving Target Defense.

Bayesian Strong Stackelberg Q-learning (BSS-Q) is a multi-agent reinforcement learning (MARL) methodology designed to learn optimal leader policies in the setting of Bayesian Stackelberg Markov Games (BSMGs). Within this framework, strategic leader-follower (defender-attacker) interactions occur under uncertainty regarding the follower’s (attacker’s) type, reflecting critical requirements in domains such as Moving Target Defense (MTD) for cybersecurity. BSS-Q provides the first scalable, provably convergent approach for learning Strong Stackelberg Equilibrium (SSE) policies in sequential incomplete-information settings, operating without prior knowledge of model rewards or transitions (Sengupta et al., 2020).

1. Bayesian Stackelberg Markov Games: Preliminaries

A Bayesian Stackelberg Markov Game (BSMG) is a two-player, turn-based stochastic game formalized as the tuple

$\bigl\langle S,\; A_L,\; A_F,\; \Theta,\; T,\; R_L,\; R_F,\; P_0,\;\gamma\bigr\rangle$

with the following elements:

$S$ : Finite set of states $\{s_1,\ldots,s_{|S|}\}$ .
$\Theta$ : Finite set of attacker (follower) types $\{\theta_1,\ldots,\theta_t\}$ ; defender maintains a state-conditional belief $\theta(s)\in\Delta(\Theta)$ with $\theta_i(s)=\Pr(\text{attacker}=i|s)$ .
$A_L(s)$ , $A_F^i(s)$ : Defender (leader) and attacker type- $i$ action sets at state $S$ 0; $S$ 1.
$S$ 2: Stochastic transition kernel.
$S$ 3, $S$ 4: Immediate rewards for leader and follower type $S$ 5.
$S$ 6: Prior on start state and follower type.
$S$ 7: Discount factor.

The leader selects a mixed strategy $S$ 8; the follower observes this commitment, selects a pure best response, and transitions follow $S$ 9. Payoffs for both agents are discounted sum utilities.

2. Strong Stackelberg Equilibrium in BSMGs

The Strong Stackelberg Equilibrium (SSE) for BSMGs comprises stationary Markovian policies:

$\{s_1,\ldots,s_{|S|}\}$ 0: leader’s mixed strategies.
$\{s_1,\ldots,s_{|S|}\}$ 1: type- $\{s_1,\ldots,s_{|S|}\}$ 2 follower’s pure strategies.

The SSE conditions at each state $\{s_1,\ldots,s_{|S|}\}$ 3 are:

a) Best Response: For each type $\{s_1,\ldots,s_{|S|}\}$ 4, $\{s_1,\ldots,s_{|S|}\}$ 5, i.e., optimal for follower type- $\{s_1,\ldots,s_{|S|}\}$ 6 given leader’s $\{s_1,\ldots,s_{|S|}\}$ 7.

b) Tie-Breaking Favoring Leader: If multiple best responses exist ( $\{s_1,\ldots,s_{|S|}\}$ 8), select $\{s_1,\ldots,s_{|S|}\}$ 9.

Existence and uniqueness of the leader’s equilibrium value are guaranteed for finite $\Theta$ 0, $\Theta$ 1, and $\Theta$ 2. The SSE characterization ensures strategic leader commitment accounting for best-responding followers under type uncertainty.

3. The Bayesian Strong Stackelberg Q-learning Algorithm

BSS-Q is a joint learning procedure that iteratively approximates SSE policies and value functions through the interaction loop. Main components:

Type-Conditional Q-Tables for Leader: $\Theta$ 3, estimating discounted rewards when the leader plays $\Theta$ 4 and the type- $\Theta$ 5 follower plays $\Theta$ 6.
Follower Q-Tables: $\Theta$ 7 for each attacker type.
Belief Tracking: $\Theta$ 8, updated online via Bayes’ rule if required.

The learning proceeds as follows:

Observe $\Theta$ 9, sample attacker type $\{\theta_1,\ldots,\theta_t\}$ 0.
Actions: $\{\theta_1,\ldots,\theta_t\}$ 1 via $\{\theta_1,\ldots,\theta_t\}$ 2-greedy w.r.t. leader strategy $\{\theta_1,\ldots,\theta_t\}$ 3; $\{\theta_1,\ldots,\theta_t\}$ 4 via $\{\theta_1,\ldots,\theta_t\}$ 5-greedy w.r.t. follower policy $\{\theta_1,\ldots,\theta_t\}$ 6.
Execute $\{\theta_1,\ldots,\theta_t\}$ 7, observe rewards $\{\theta_1,\ldots,\theta_t\}$ 8 and next state $\{\theta_1,\ldots,\theta_t\}$ 9.
Compute the stage-game SSE at $\theta(s)\in\Delta(\Theta)$ 0 using current Q-tables as payoff matrices; obtain $\theta(s)\in\Delta(\Theta)$ 1 and equilibrium values $\theta(s)\in\Delta(\Theta)$ 2.
Q-Table Updates:

$\theta(s)\in\Delta(\Theta)$ 3

$\theta(s)\in\Delta(\Theta)$ 4

Optionally update belief $\theta(s)\in\Delta(\Theta)$ 5 by Bayes’ rule using observed $\theta(s)\in\Delta(\Theta)$ 6.

This iterative process enables principled policy improvement in adversarial partially observable settings.

4. Convergence Properties

Under standard stochastic-approximation conditions—step-sizes $\theta(s)\in\Delta(\Theta)$ 7 with $\theta(s)\in\Delta(\Theta)$ 8, $\theta(s)\in\Delta(\Theta)$ 9; infinite visitation to each $\theta_i(s)=\Pr(\text{attacker}=i|s)$ 0-tuple; $\theta_i(s)=\Pr(\text{attacker}=i|s)$ 1—BSS-Q's Q-tables converge almost surely to the unique fixed point of the Bellman operator defined by the stage-game SSE. The policies thus induced converge to a stationary SSE of the underlying BSMG.

The underlying Bellman operator for leader-type $\theta_i(s)=\Pr(\text{attacker}=i|s)$ 2 is:

$\theta_i(s)=\Pr(\text{attacker}=i|s)$ 3

where $\theta_i(s)=\Pr(\text{attacker}=i|s)$ 4 is derived from the Q-tables via SSE computation. The operator $\theta_i(s)=\Pr(\text{attacker}=i|s)$ 5 is a $\theta_i(s)=\Pr(\text{attacker}=i|s)$ 6-contraction in the sup-norm due to discounting and leader-favoring tie-breaking. Standard stochastic approximation guarantees convergence to fixed-point policies $\theta_i(s)=\Pr(\text{attacker}=i|s)$ 7.

5. Computational Complexity and Scalability

BSS-Q's storage and computation are dominated by:

Q-table Storage: $\theta_i(s)=\Pr(\text{attacker}=i|s)$ 8 Q-tables of size $\theta_i(s)=\Pr(\text{attacker}=i|s)$ 9.
Per-Step SSE Computation: At each transition, a compact Bayesian Stackelberg normal-form (stage-)game of size $A_L(s)$ 0 must be solved. Exact SSE computation is NP-hard in the number of types and follower actions, but practical encoding as a MILP ( $A_L(s)$ 1) is tractable up to hundreds of actions, yielding solutions in milliseconds to seconds.

Compared to single-agent RL—with $A_L(s)$ 2 storage and $A_L(s)$ 3 policy improvement per-step—BSS-Q adds game-solving overhead per step. Empirically, BSS-Q is feasible for cybersecurity MTD domains where $A_L(s)$ 4 and total attacker actions $A_L(s)$ 5.

6. Empirical Evaluation

BSS-Q was evaluated in two MTD scenarios within an OpenAI Gym-style environment, without prior knowledge of the transition or reward model.

A. Web-Application Stack Defense

States: 4 configurations ( $A_L(s)$ 6).
3 attacker types: DB-expert (269 actions), ScriptKiddie (34), BlackHat (48), action sets derived from CVEs.
Defender actions: 4 configurations; switching and attack costs from empirical latency and CVSS scores.

B. Cloud-Network IDS Placement

States correspond to attack graph levels ( $A_L(s)$ 7).
Single attacker type ( $A_L(s)$ 8– $A_L(s)$ 9 actions per state).
Defender actions: host-based or network-based IDS placement/removal.
Rewards from CVSS; transitions from attack graph structure.

Baselines

Uniform Random Strategy (URS).
Bayesian EXP-Q (bandit approach ignoring strategic best responses).
Nash-Q (computes Nash equilibria at each stage).
State-agnostic “optimal” (S-OPT) from a two-stage BSG formulation.

Metrics & Results

Scenario	States	# Attacker Types	BSS-Q Performance	Baselines Performance	Planning Time
Web-Application Stack	4	3	Converged near FI-SSE,	URS, B-EXP-Q, S-OPT outperformed in	BSS-Q: $A_F^i(s)$ 0150 s/episode
			eliminated cycles	at least 2 states; SA-RL cycles	B-EXP-Q: $A_F^i(s)$ 1220 s/episode
Cloud-Network IDS	4	1	Outperformed URS, EXP-Q	Nash-Q matched/slighly exceeded,	URS: $A_F^i(s)$ 2100 s/episode;
			in 2/3 non-terminal	Nash-Q sensitive to equiv.	Nash-Q failed to scale
			states, matched Nash-Q

BSS-Q established rapid convergence to SSE-like rewards, improved state-of-the-art MTD performance, and eliminated exploitable cycles characteristic of single-agent RL or myopic bandit-style baselines.

7. Significance and Application Scope

BSS-Q provides a theoretically justified, scalable solution for adaptive decision-making in sequential adversarial environments with incomplete information, specifically suited for cybersecurity MTD domains. It bridges MARL and game-theoretic learning for environments where the defender must account for type uncertainty and commit to strategies robust against strategic, best-responding attackers. The framework extends to settings lacking prior distributional knowledge over state transitions or rewards, addressing key challenges in practical security defense planning (Sengupta et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

Multi-agent Reinforcement Learning in Bayesian Stackelberg Markov Games for Adaptive Moving Target Defense (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bayesian Strong Stackelberg Q-learning (BSS-Q).