Bayesian Strong Stackelberg Q-learning
- Bayesian Strong Stackelberg Q-learning (BSS-Q) is a multi-agent reinforcement learning method that computes optimal leader policies in games under uncertainty.
- It iteratively approximates Strong Stackelberg Equilibria by updating type-conditional Q-tables and beliefs using online Bayesian updates.
- BSS-Q ensures convergence to equilibrium strategies in sequential adversarial settings and offers scalable solutions for cybersecurity Moving Target Defense.
Bayesian Strong Stackelberg Q-learning (BSS-Q) is a multi-agent reinforcement learning (MARL) methodology designed to learn optimal leader policies in the setting of Bayesian Stackelberg Markov Games (BSMGs). Within this framework, strategic leader-follower (defender-attacker) interactions occur under uncertainty regarding the follower’s (attacker’s) type, reflecting critical requirements in domains such as Moving Target Defense (MTD) for cybersecurity. BSS-Q provides the first scalable, provably convergent approach for learning Strong Stackelberg Equilibrium (SSE) policies in sequential incomplete-information settings, operating without prior knowledge of model rewards or transitions (Sengupta et al., 2020).
1. Bayesian Stackelberg Markov Games: Preliminaries
A Bayesian Stackelberg Markov Game (BSMG) is a two-player, turn-based stochastic game formalized as the tuple
with the following elements:
- : Finite set of states .
- : Finite set of attacker (follower) types ; defender maintains a state-conditional belief with .
- , : Defender (leader) and attacker type- action sets at state 0; 1.
- 2: Stochastic transition kernel.
- 3, 4: Immediate rewards for leader and follower type 5.
- 6: Prior on start state and follower type.
- 7: Discount factor.
The leader selects a mixed strategy 8; the follower observes this commitment, selects a pure best response, and transitions follow 9. Payoffs for both agents are discounted sum utilities.
2. Strong Stackelberg Equilibrium in BSMGs
The Strong Stackelberg Equilibrium (SSE) for BSMGs comprises stationary Markovian policies:
- 0: leader’s mixed strategies.
- 1: type-2 follower’s pure strategies.
The SSE conditions at each state 3 are:
a) Best Response: For each type 4, 5, i.e., optimal for follower type-6 given leader’s 7.
b) Tie-Breaking Favoring Leader: If multiple best responses exist (8), select 9.
Existence and uniqueness of the leader’s equilibrium value are guaranteed for finite 0, 1, and 2. The SSE characterization ensures strategic leader commitment accounting for best-responding followers under type uncertainty.
3. The Bayesian Strong Stackelberg Q-learning Algorithm
BSS-Q is a joint learning procedure that iteratively approximates SSE policies and value functions through the interaction loop. Main components:
- Type-Conditional Q-Tables for Leader: 3, estimating discounted rewards when the leader plays 4 and the type-5 follower plays 6.
- Follower Q-Tables: 7 for each attacker type.
- Belief Tracking: 8, updated online via Bayes’ rule if required.
The learning proceeds as follows:
- Observe 9, sample attacker type 0.
- Actions: 1 via 2-greedy w.r.t. leader strategy 3; 4 via 5-greedy w.r.t. follower policy 6.
- Execute 7, observe rewards 8 and next state 9.
- Compute the stage-game SSE at 0 using current Q-tables as payoff matrices; obtain 1 and equilibrium values 2.
- Q-Table Updates:
3
4
- Optionally update belief 5 by Bayes’ rule using observed 6.
This iterative process enables principled policy improvement in adversarial partially observable settings.
4. Convergence Properties
Under standard stochastic-approximation conditions—step-sizes 7 with 8, 9; infinite visitation to each 0-tuple; 1—BSS-Q's Q-tables converge almost surely to the unique fixed point of the Bellman operator defined by the stage-game SSE. The policies thus induced converge to a stationary SSE of the underlying BSMG.
The underlying Bellman operator for leader-type 2 is:
3
where 4 is derived from the Q-tables via SSE computation. The operator 5 is a 6-contraction in the sup-norm due to discounting and leader-favoring tie-breaking. Standard stochastic approximation guarantees convergence to fixed-point policies 7.
5. Computational Complexity and Scalability
BSS-Q's storage and computation are dominated by:
- Q-table Storage: 8 Q-tables of size 9.
- Per-Step SSE Computation: At each transition, a compact Bayesian Stackelberg normal-form (stage-)game of size 0 must be solved. Exact SSE computation is NP-hard in the number of types and follower actions, but practical encoding as a MILP (1) is tractable up to hundreds of actions, yielding solutions in milliseconds to seconds.
Compared to single-agent RL—with 2 storage and 3 policy improvement per-step—BSS-Q adds game-solving overhead per step. Empirically, BSS-Q is feasible for cybersecurity MTD domains where 4 and total attacker actions 5.
6. Empirical Evaluation
BSS-Q was evaluated in two MTD scenarios within an OpenAI Gym-style environment, without prior knowledge of the transition or reward model.
A. Web-Application Stack Defense
- States: 4 configurations (6).
- 3 attacker types: DB-expert (269 actions), ScriptKiddie (34), BlackHat (48), action sets derived from CVEs.
- Defender actions: 4 configurations; switching and attack costs from empirical latency and CVSS scores.
B. Cloud-Network IDS Placement
- States correspond to attack graph levels (7).
- Single attacker type (8–9 actions per state).
- Defender actions: host-based or network-based IDS placement/removal.
- Rewards from CVSS; transitions from attack graph structure.
Baselines
- Uniform Random Strategy (URS).
- Bayesian EXP-Q (bandit approach ignoring strategic best responses).
- Nash-Q (computes Nash equilibria at each stage).
- State-agnostic “optimal” (S-OPT) from a two-stage BSG formulation.
Metrics & Results
| Scenario | States | # Attacker Types | BSS-Q Performance | Baselines Performance | Planning Time |
|---|---|---|---|---|---|
| Web-Application Stack | 4 | 3 | Converged near FI-SSE, | URS, B-EXP-Q, S-OPT outperformed in | BSS-Q: 0150 s/episode |
| eliminated cycles | at least 2 states; SA-RL cycles | B-EXP-Q: 1220 s/episode | |||
| Cloud-Network IDS | 4 | 1 | Outperformed URS, EXP-Q | Nash-Q matched/slighly exceeded, | URS: 2100 s/episode; |
| in 2/3 non-terminal | Nash-Q sensitive to equiv. | Nash-Q failed to scale | |||
| states, matched Nash-Q |
BSS-Q established rapid convergence to SSE-like rewards, improved state-of-the-art MTD performance, and eliminated exploitable cycles characteristic of single-agent RL or myopic bandit-style baselines.
7. Significance and Application Scope
BSS-Q provides a theoretically justified, scalable solution for adaptive decision-making in sequential adversarial environments with incomplete information, specifically suited for cybersecurity MTD domains. It bridges MARL and game-theoretic learning for environments where the defender must account for type uncertainty and commit to strategies robust against strategic, best-responding attackers. The framework extends to settings lacking prior distributional knowledge over state transitions or rewards, addressing key challenges in practical security defense planning (Sengupta et al., 2020).