Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bayesian Strong Stackelberg Q-learning

Updated 18 June 2026
  • Bayesian Strong Stackelberg Q-learning (BSS-Q) is a multi-agent reinforcement learning method that computes optimal leader policies in games under uncertainty.
  • It iteratively approximates Strong Stackelberg Equilibria by updating type-conditional Q-tables and beliefs using online Bayesian updates.
  • BSS-Q ensures convergence to equilibrium strategies in sequential adversarial settings and offers scalable solutions for cybersecurity Moving Target Defense.

Bayesian Strong Stackelberg Q-learning (BSS-Q) is a multi-agent reinforcement learning (MARL) methodology designed to learn optimal leader policies in the setting of Bayesian Stackelberg Markov Games (BSMGs). Within this framework, strategic leader-follower (defender-attacker) interactions occur under uncertainty regarding the follower’s (attacker’s) type, reflecting critical requirements in domains such as Moving Target Defense (MTD) for cybersecurity. BSS-Q provides the first scalable, provably convergent approach for learning Strong Stackelberg Equilibrium (SSE) policies in sequential incomplete-information settings, operating without prior knowledge of model rewards or transitions (Sengupta et al., 2020).

1. Bayesian Stackelberg Markov Games: Preliminaries

A Bayesian Stackelberg Markov Game (BSMG) is a two-player, turn-based stochastic game formalized as the tuple

S,  AL,  AF,  Θ,  T,  RL,  RF,  P0,  γ\bigl\langle S,\; A_L,\; A_F,\; \Theta,\; T,\; R_L,\; R_F,\; P_0,\;\gamma\bigr\rangle

with the following elements:

  • SS: Finite set of states {s1,,sS}\{s_1,\ldots,s_{|S|}\}.
  • Θ\Theta: Finite set of attacker (follower) types {θ1,,θt}\{\theta_1,\ldots,\theta_t\}; defender maintains a state-conditional belief θ(s)Δ(Θ)\theta(s)\in\Delta(\Theta) with θi(s)=Pr(attacker=is)\theta_i(s)=\Pr(\text{attacker}=i|s).
  • AL(s)A_L(s), AFi(s)A_F^i(s): Defender (leader) and attacker type-ii action sets at state SS0; SS1.
  • SS2: Stochastic transition kernel.
  • SS3, SS4: Immediate rewards for leader and follower type SS5.
  • SS6: Prior on start state and follower type.
  • SS7: Discount factor.

The leader selects a mixed strategy SS8; the follower observes this commitment, selects a pure best response, and transitions follow SS9. Payoffs for both agents are discounted sum utilities.

2. Strong Stackelberg Equilibrium in BSMGs

The Strong Stackelberg Equilibrium (SSE) for BSMGs comprises stationary Markovian policies:

  • {s1,,sS}\{s_1,\ldots,s_{|S|}\}0: leader’s mixed strategies.
  • {s1,,sS}\{s_1,\ldots,s_{|S|}\}1: type-{s1,,sS}\{s_1,\ldots,s_{|S|}\}2 follower’s pure strategies.

The SSE conditions at each state {s1,,sS}\{s_1,\ldots,s_{|S|}\}3 are:

a) Best Response: For each type {s1,,sS}\{s_1,\ldots,s_{|S|}\}4, {s1,,sS}\{s_1,\ldots,s_{|S|}\}5, i.e., optimal for follower type-{s1,,sS}\{s_1,\ldots,s_{|S|}\}6 given leader’s {s1,,sS}\{s_1,\ldots,s_{|S|}\}7.

b) Tie-Breaking Favoring Leader: If multiple best responses exist ({s1,,sS}\{s_1,\ldots,s_{|S|}\}8), select {s1,,sS}\{s_1,\ldots,s_{|S|}\}9.

Existence and uniqueness of the leader’s equilibrium value are guaranteed for finite Θ\Theta0, Θ\Theta1, and Θ\Theta2. The SSE characterization ensures strategic leader commitment accounting for best-responding followers under type uncertainty.

3. The Bayesian Strong Stackelberg Q-learning Algorithm

BSS-Q is a joint learning procedure that iteratively approximates SSE policies and value functions through the interaction loop. Main components:

  • Type-Conditional Q-Tables for Leader: Θ\Theta3, estimating discounted rewards when the leader plays Θ\Theta4 and the type-Θ\Theta5 follower plays Θ\Theta6.
  • Follower Q-Tables: Θ\Theta7 for each attacker type.
  • Belief Tracking: Θ\Theta8, updated online via Bayes’ rule if required.

The learning proceeds as follows:

  1. Observe Θ\Theta9, sample attacker type {θ1,,θt}\{\theta_1,\ldots,\theta_t\}0.
  2. Actions: {θ1,,θt}\{\theta_1,\ldots,\theta_t\}1 via {θ1,,θt}\{\theta_1,\ldots,\theta_t\}2-greedy w.r.t. leader strategy {θ1,,θt}\{\theta_1,\ldots,\theta_t\}3; {θ1,,θt}\{\theta_1,\ldots,\theta_t\}4 via {θ1,,θt}\{\theta_1,\ldots,\theta_t\}5-greedy w.r.t. follower policy {θ1,,θt}\{\theta_1,\ldots,\theta_t\}6.
  3. Execute {θ1,,θt}\{\theta_1,\ldots,\theta_t\}7, observe rewards {θ1,,θt}\{\theta_1,\ldots,\theta_t\}8 and next state {θ1,,θt}\{\theta_1,\ldots,\theta_t\}9.
  4. Compute the stage-game SSE at θ(s)Δ(Θ)\theta(s)\in\Delta(\Theta)0 using current Q-tables as payoff matrices; obtain θ(s)Δ(Θ)\theta(s)\in\Delta(\Theta)1 and equilibrium values θ(s)Δ(Θ)\theta(s)\in\Delta(\Theta)2.
  5. Q-Table Updates:

θ(s)Δ(Θ)\theta(s)\in\Delta(\Theta)3

θ(s)Δ(Θ)\theta(s)\in\Delta(\Theta)4

  1. Optionally update belief θ(s)Δ(Θ)\theta(s)\in\Delta(\Theta)5 by Bayes’ rule using observed θ(s)Δ(Θ)\theta(s)\in\Delta(\Theta)6.

This iterative process enables principled policy improvement in adversarial partially observable settings.

4. Convergence Properties

Under standard stochastic-approximation conditions—step-sizes θ(s)Δ(Θ)\theta(s)\in\Delta(\Theta)7 with θ(s)Δ(Θ)\theta(s)\in\Delta(\Theta)8, θ(s)Δ(Θ)\theta(s)\in\Delta(\Theta)9; infinite visitation to each θi(s)=Pr(attacker=is)\theta_i(s)=\Pr(\text{attacker}=i|s)0-tuple; θi(s)=Pr(attacker=is)\theta_i(s)=\Pr(\text{attacker}=i|s)1—BSS-Q's Q-tables converge almost surely to the unique fixed point of the Bellman operator defined by the stage-game SSE. The policies thus induced converge to a stationary SSE of the underlying BSMG.

The underlying Bellman operator for leader-type θi(s)=Pr(attacker=is)\theta_i(s)=\Pr(\text{attacker}=i|s)2 is:

θi(s)=Pr(attacker=is)\theta_i(s)=\Pr(\text{attacker}=i|s)3

where θi(s)=Pr(attacker=is)\theta_i(s)=\Pr(\text{attacker}=i|s)4 is derived from the Q-tables via SSE computation. The operator θi(s)=Pr(attacker=is)\theta_i(s)=\Pr(\text{attacker}=i|s)5 is a θi(s)=Pr(attacker=is)\theta_i(s)=\Pr(\text{attacker}=i|s)6-contraction in the sup-norm due to discounting and leader-favoring tie-breaking. Standard stochastic approximation guarantees convergence to fixed-point policies θi(s)=Pr(attacker=is)\theta_i(s)=\Pr(\text{attacker}=i|s)7.

5. Computational Complexity and Scalability

BSS-Q's storage and computation are dominated by:

  • Q-table Storage: θi(s)=Pr(attacker=is)\theta_i(s)=\Pr(\text{attacker}=i|s)8 Q-tables of size θi(s)=Pr(attacker=is)\theta_i(s)=\Pr(\text{attacker}=i|s)9.
  • Per-Step SSE Computation: At each transition, a compact Bayesian Stackelberg normal-form (stage-)game of size AL(s)A_L(s)0 must be solved. Exact SSE computation is NP-hard in the number of types and follower actions, but practical encoding as a MILP (AL(s)A_L(s)1) is tractable up to hundreds of actions, yielding solutions in milliseconds to seconds.

Compared to single-agent RL—with AL(s)A_L(s)2 storage and AL(s)A_L(s)3 policy improvement per-step—BSS-Q adds game-solving overhead per step. Empirically, BSS-Q is feasible for cybersecurity MTD domains where AL(s)A_L(s)4 and total attacker actions AL(s)A_L(s)5.

6. Empirical Evaluation

BSS-Q was evaluated in two MTD scenarios within an OpenAI Gym-style environment, without prior knowledge of the transition or reward model.

A. Web-Application Stack Defense

  • States: 4 configurations (AL(s)A_L(s)6).
  • 3 attacker types: DB-expert (269 actions), ScriptKiddie (34), BlackHat (48), action sets derived from CVEs.
  • Defender actions: 4 configurations; switching and attack costs from empirical latency and CVSS scores.

B. Cloud-Network IDS Placement

  • States correspond to attack graph levels (AL(s)A_L(s)7).
  • Single attacker type (AL(s)A_L(s)8–AL(s)A_L(s)9 actions per state).
  • Defender actions: host-based or network-based IDS placement/removal.
  • Rewards from CVSS; transitions from attack graph structure.

Baselines

  • Uniform Random Strategy (URS).
  • Bayesian EXP-Q (bandit approach ignoring strategic best responses).
  • Nash-Q (computes Nash equilibria at each stage).
  • State-agnostic “optimal” (S-OPT) from a two-stage BSG formulation.

Metrics & Results

Scenario States # Attacker Types BSS-Q Performance Baselines Performance Planning Time
Web-Application Stack 4 3 Converged near FI-SSE, URS, B-EXP-Q, S-OPT outperformed in BSS-Q: AFi(s)A_F^i(s)0150 s/episode
eliminated cycles at least 2 states; SA-RL cycles B-EXP-Q: AFi(s)A_F^i(s)1220 s/episode
Cloud-Network IDS 4 1 Outperformed URS, EXP-Q Nash-Q matched/slighly exceeded, URS: AFi(s)A_F^i(s)2100 s/episode;
in 2/3 non-terminal Nash-Q sensitive to equiv. Nash-Q failed to scale
states, matched Nash-Q

BSS-Q established rapid convergence to SSE-like rewards, improved state-of-the-art MTD performance, and eliminated exploitable cycles characteristic of single-agent RL or myopic bandit-style baselines.

7. Significance and Application Scope

BSS-Q provides a theoretically justified, scalable solution for adaptive decision-making in sequential adversarial environments with incomplete information, specifically suited for cybersecurity MTD domains. It bridges MARL and game-theoretic learning for environments where the defender must account for type uncertainty and commit to strategies robust against strategic, best-responding attackers. The framework extends to settings lacking prior distributional knowledge over state transitions or rewards, addressing key challenges in practical security defense planning (Sengupta et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bayesian Strong Stackelberg Q-learning (BSS-Q).