Multi-Agent Semi-Markov Decision Process (MSMDP)

Updated 30 January 2026

MSMDP is a formal framework for modeling trajectory-level decision-making among agents using temporally extended macro-actions.
It integrates hierarchical reinforcement learning with mean-field actor-critic methods to enable scalable strategy selection in complex environments like robotic soccer.
The framework employs high-level and low-level policies to optimize team rewards through spatial awareness and temporal abstraction, promoting advanced cooperation.

A Multi-Agent Semi-Markov Decision Process (MSMDP) is a formal framework for modeling trajectory-level decision-making among multiple agents where macro-actions (options) are temporally extended. In the hierarchical reinforcement learning (HRL) architecture proposed by Taourirte & Mia (2024), the MSMDP captures high-level strategy selection for teams of homogeneous agents, specifically applied to robotic soccer in adversarial, real-time, multi-agent virtual environments (Taourirte et al., 2 Dec 2025).

1. Mathematical Formulation of MSMDP

The MSMDP is structured as a tuple $(S, \{O^n\}_{n=1}^N, P^{(0)}, R^{(0)}, \gamma)$ , defining the joint state, options, transition kernel, reward function, and discount factor. At each high-level decision index $i$ , the global state $s_i^T$ aggregates agents' and ball attributes:

$P \in \mathbb{R}^{N \times 2}$ : Teammates' $(x, y)$ positions
$P_{\text{goal}} \in \mathbb{R}^2$ : Ball position
$if_c \in \{0,1\}^N$ : Indicator vector for ball control per agent
$P^{oppo} \in \mathbb{R}^{N \times 2}$ : Opponents’ positions

Each agent $n$ selects a high-level option from $O^n = \{o^n_k \mid k=1...8\}$ , corresponding to eight coarse trajectory directions over the next $\Delta$ low-level steps. Each option $o$ is defined by a triple $(I^o, \pi^o, \beta^o)$ :

Initiation set $I^o = S$ (all states permit any directional option)
Intra-option policy $\pi^o(a_t \mid s_t)$ , implemented via PPO at the low level
Termination $\beta^o(s_{i+\Delta}^T) = 1$ (fixed $\Delta$ step duration)

The semi-Markov kernel is $P(s' \mid s, o) = \Pr[s_{i+1}^T = s' \mid s_i^T = s, a_i^T = o]$ with $s'$ determined after $\tau = \Delta$ steps of low-level interaction, reflecting the options' temporal extension.

2. High-Level Reward Design

The high-level reward $R^{(0)}$ aggregates instantaneous low-level team rewards $r_t$ accrued over $\Delta$ steps for each option:

$r_t = r_t^{goal} + r_t^{close} + r_t^{control}$

Where:

$r_t^{goal} = 1_{\text{goal scored at }t}$
$r_t^{close} = 0.1 \sum_{n=1}^N 1/d_n(t)$ , $d_n$ being robot-to-ball distance
$r_t^{control} = 0.4 \cdot 1_{\text{robot holds ball}} + 0.1 \sum_{k \in \mathcal{N}(n)} 1/d_k^{goal}(t)$ , $d_k^{goal}$ denoting non-holder’s distance to opponent goal, $\mathcal{N}(n)$ the neighborhood of agent $n$

Hence, for an option $o$ taken at state $s_i^T$ and ending at $s_{i+1}^T$ ,

$R^{(0)}(s_i^T, o) = \sum_{t=i\Delta}^{i\Delta+\Delta-1} r_t$

This design incentivizes scoring, ball proximity, and control, with explicit spatial awareness embedded.

3. Value Functions and Hierarchical Policy Objective

Let $\pi^T(o \mid s)$ denote the high-level stochastic policy over options. The HRL scheme aims to maximize the expected discounted sum of rewards every $\Delta$ high-level steps:

$J(\theta^T) = E_{o_i \sim \pi^T, s_i} \left[ \sum_{i=0}^\infty \gamma^i R^{(0)}(s_i, o_i) \right]$

This is equivalently expressed in low-level timesteps:

$E \left[ \sum_{i=0}^\infty \gamma^{i\Delta} R^{(0)}(s_i, o_i) \right]$

Bellman equations for SMDP option and state value functions are:

$Q^T(s, o) = E[R^{(0)}(s, o) + \gamma^\Delta V^T(s') \mid s, o]$

$V^T(s) = \sum_{o} \pi^T(o|s) Q^T(s, o)$

These express the modular temporal abstraction underpinning high-level policy learning.

4. Mean-Field Actor-Critic Integration for Scalability

To scale the architecture for large $N$ , agent $n$ approximates the collective impact of peers via the mean option $\bar a_i^n = \frac{1}{|\mathcal{N}(n)|} \sum_{k \in \mathcal{N}(n)} a_i^k$ . This simplifies multi-agent interactions to agent-vs-population averages and permits stable learning as shown in robotic soccer experiments (Taourirte et al., 2 Dec 2025).

The high-level mean-field Q-update:

$Q^n(s, a^n, \bar a^n) \leftarrow (1-\alpha) Q^n(s, a^n, \bar a^n) + \alpha [R^{(0)}(s, a^n) + \gamma^{\Delta} V^n(s')]$

where

$V^n(s') = E_{o' \sim \pi^n(s', \cdot, \bar a')} [Q^n(s', o', \bar a')]$

The mean-field policy gradient update:

$\nabla_{\theta^n} J^n(\theta^n) \approx E_{s, a^n, \bar a^n} [\nabla_{\theta^n} \log \pi^n(a^n|s, \bar a^n; \theta^n) \cdot Q^n(s, a^n, \bar a^n; w^n)]$

Concurrently, PPO optimizes each intra-option policy $\pi^o$ at the low level.

5. Learning and Execution Procedure

Training and execution of the MSMDP follows a hierarchical protocol. Each episode begins with environment reset and state observation. For each high-level time index $i$ , all agents sample options conditioned on both state and neighborhood mean. These options execute for $\Delta$ steps, during which low-level actions are decided via PPO, and cumulative team reward is collected.

After execution, agents update high-level value and policy parameters:

High-level TD error:

$\delta_{n,i}^T = R_i + \gamma^\Delta V_n^T(s_{i+1}^T) - V_n^T(s_i^T)$

Critic gradient:

$w_n^T \leftarrow w_n^T + \beta \delta_{n,i}^T \nabla_{w_n^T} V_n^T(s_i^T)$

Actor gradient:

$\theta_n^T \leftarrow \theta_n^T + \alpha \delta_{n,i}^T \nabla_{\theta_n^T} \log \pi_n^T(a_{n,i}^T | s_i^T, \bar a_{n,i-1})$

Mean-field Q-table (or network) update:

$Q^n(s_i^T, a_i^n, \bar a_{i-1}^n) \leftarrow (1-\alpha) Q^n(\cdots) + \alpha [R_{\text{sum}} + \gamma^\Delta V^n(s_{i+1}^T)]$

The algorithm cycles through episodes, yielding performance metrics superior to non-hierarchical MARL and non-mean-field approaches (e.g., 5.93 avg. goals, 89.1% ball control, 92.3% passing accuracy in 4v4 Webots simulations).

6. Significance in Multi-Agent Reinforcement Learning

The MSMDP architecture directly addresses the curse of dimensionality in multi-agent RL, supports multi-granular temporal abstraction (macro-strategy vs. micro-execution), and is validated in adversarial stochastic domains. The mean-field actor-critic module provides a scalable trajectory planning solution, enabling robust cooperation and strategic behavior amongst many agents. A plausible implication is that this MSMDP-HRL paradigm generalizes to other domains characterized by frequent interaction, real-time constraints, and large homogeneous cohorts.

7. Summary Table: MSMDP Components (as formulated by Taourirte & Mia, 2024)

Component	Symbol/Structure	Description
Joint State Space	$s_i^T = [P, P_{goal}, if_c, P^{oppo}]$	Concatenated agent, ball, and opponent data
Option Set	$O^n = \{o_k^n\}_{k=1}^8$	Eight spatial directions per agent
Option Definition	$(I^o, \pi^o, \beta^o)$	Initiation, intra-option PPO, fixed duration
High-Level Reward	$R^{(0)}(s_i^T, o)$	Sum of low-level team rewards over $\Delta$ steps
Mean-Field Update	$Q^n(s, a^n, \bar a^n)$	Aggregated population impact in TD learning

All components are instantiated to facilitate hierarchical, mean-field multi-agent RL as validated in robotic soccer simulation research (Taourirte et al., 2 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Multi-Agent Reinforcement Learning and Real-Time Decision-Making in Robotic Soccer for Virtual Environments (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Agent Semi-Markov Decision Process (MSMDP).

Multi-Agent Semi-Markov Decision Process (MSMDP)

1. Mathematical Formulation of MSMDP

2. High-Level Reward Design

3. Value Functions and Hierarchical Policy Objective

4. Mean-Field Actor-Critic Integration for Scalability

5. Learning and Execution Procedure

6. Significance in Multi-Agent Reinforcement Learning

7. Summary Table: MSMDP Components (as formulated by Taourirte & Mia, 2024)

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Multi-Agent Semi-Markov Decision Process (MSMDP)

1. Mathematical Formulation of MSMDP

2. High-Level Reward Design

3. Value Functions and Hierarchical Policy Objective

4. Mean-Field Actor-Critic Integration for Scalability

5. Learning and Execution Procedure

6. Significance in Multi-Agent Reinforcement Learning

7. Summary Table: MSMDP Components (as formulated by Taourirte & Mia, 2024)

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research