Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Agent Semi-Markov Decision Process (MSMDP)

Updated 30 January 2026
  • MSMDP is a formal framework for modeling trajectory-level decision-making among agents using temporally extended macro-actions.
  • It integrates hierarchical reinforcement learning with mean-field actor-critic methods to enable scalable strategy selection in complex environments like robotic soccer.
  • The framework employs high-level and low-level policies to optimize team rewards through spatial awareness and temporal abstraction, promoting advanced cooperation.

A Multi-Agent Semi-Markov Decision Process (MSMDP) is a formal framework for modeling trajectory-level decision-making among multiple agents where macro-actions (options) are temporally extended. In the hierarchical reinforcement learning (HRL) architecture proposed by Taourirte & Mia (2024), the MSMDP captures high-level strategy selection for teams of homogeneous agents, specifically applied to robotic soccer in adversarial, real-time, multi-agent virtual environments (Taourirte et al., 2 Dec 2025).

1. Mathematical Formulation of MSMDP

The MSMDP is structured as a tuple (S,{On}n=1N,P(0),R(0),γ)(S, \{O^n\}_{n=1}^N, P^{(0)}, R^{(0)}, \gamma), defining the joint state, options, transition kernel, reward function, and discount factor. At each high-level decision index ii, the global state siTs_i^T aggregates agents' and ball attributes:

  • PRN×2P \in \mathbb{R}^{N \times 2}: Teammates' (x,y)(x, y) positions
  • PgoalR2P_{\text{goal}} \in \mathbb{R}^2: Ball position
  • ifc{0,1}Nif_c \in \{0,1\}^N: Indicator vector for ball control per agent
  • PoppoRN×2P^{oppo} \in \mathbb{R}^{N \times 2}: Opponents’ positions

Each agent nn selects a high-level option from On={oknk=1...8}O^n = \{o^n_k \mid k=1...8\}, corresponding to eight coarse trajectory directions over the next Δ\Delta low-level steps. Each option oo is defined by a triple (Io,πo,βo)(I^o, \pi^o, \beta^o):

  • Initiation set Io=SI^o = S (all states permit any directional option)
  • Intra-option policy πo(atst)\pi^o(a_t \mid s_t), implemented via PPO at the low level
  • Termination βo(si+ΔT)=1\beta^o(s_{i+\Delta}^T) = 1 (fixed Δ\Delta step duration)

The semi-Markov kernel is P(ss,o)=Pr[si+1T=ssiT=s,aiT=o]P(s' \mid s, o) = \Pr[s_{i+1}^T = s' \mid s_i^T = s, a_i^T = o] with ss' determined after τ=Δ\tau = \Delta steps of low-level interaction, reflecting the options' temporal extension.

2. High-Level Reward Design

The high-level reward R(0)R^{(0)} aggregates instantaneous low-level team rewards rtr_t accrued over Δ\Delta steps for each option:

rt=rtgoal+rtclose+rtcontrolr_t = r_t^{goal} + r_t^{close} + r_t^{control}

Where:

  • rtgoal=1goal scored at tr_t^{goal} = 1_{\text{goal scored at }t}
  • rtclose=0.1n=1N1/dn(t)r_t^{close} = 0.1 \sum_{n=1}^N 1/d_n(t), dnd_n being robot-to-ball distance
  • rtcontrol=0.41robot holds ball+0.1kN(n)1/dkgoal(t)r_t^{control} = 0.4 \cdot 1_{\text{robot holds ball}} + 0.1 \sum_{k \in \mathcal{N}(n)} 1/d_k^{goal}(t), dkgoald_k^{goal} denoting non-holder’s distance to opponent goal, N(n)\mathcal{N}(n) the neighborhood of agent nn

Hence, for an option oo taken at state siTs_i^T and ending at si+1Ts_{i+1}^T,

R(0)(siT,o)=t=iΔiΔ+Δ1rtR^{(0)}(s_i^T, o) = \sum_{t=i\Delta}^{i\Delta+\Delta-1} r_t

This design incentivizes scoring, ball proximity, and control, with explicit spatial awareness embedded.

3. Value Functions and Hierarchical Policy Objective

Let πT(os)\pi^T(o \mid s) denote the high-level stochastic policy over options. The HRL scheme aims to maximize the expected discounted sum of rewards every Δ\Delta high-level steps:

J(θT)=EoiπT,si[i=0γiR(0)(si,oi)]J(\theta^T) = E_{o_i \sim \pi^T, s_i} \left[ \sum_{i=0}^\infty \gamma^i R^{(0)}(s_i, o_i) \right]

This is equivalently expressed in low-level timesteps:

E[i=0γiΔR(0)(si,oi)]E \left[ \sum_{i=0}^\infty \gamma^{i\Delta} R^{(0)}(s_i, o_i) \right]

Bellman equations for SMDP option and state value functions are:

QT(s,o)=E[R(0)(s,o)+γΔVT(s)s,o]Q^T(s, o) = E[R^{(0)}(s, o) + \gamma^\Delta V^T(s') \mid s, o]

VT(s)=oπT(os)QT(s,o)V^T(s) = \sum_{o} \pi^T(o|s) Q^T(s, o)

These express the modular temporal abstraction underpinning high-level policy learning.

4. Mean-Field Actor-Critic Integration for Scalability

To scale the architecture for large NN, agent nn approximates the collective impact of peers via the mean option aˉin=1N(n)kN(n)aik\bar a_i^n = \frac{1}{|\mathcal{N}(n)|} \sum_{k \in \mathcal{N}(n)} a_i^k. This simplifies multi-agent interactions to agent-vs-population averages and permits stable learning as shown in robotic soccer experiments (Taourirte et al., 2 Dec 2025).

The high-level mean-field Q-update:

Qn(s,an,aˉn)(1α)Qn(s,an,aˉn)+α[R(0)(s,an)+γΔVn(s)]Q^n(s, a^n, \bar a^n) \leftarrow (1-\alpha) Q^n(s, a^n, \bar a^n) + \alpha [R^{(0)}(s, a^n) + \gamma^{\Delta} V^n(s')]

where

Vn(s)=Eoπn(s,,aˉ)[Qn(s,o,aˉ)]V^n(s') = E_{o' \sim \pi^n(s', \cdot, \bar a')} [Q^n(s', o', \bar a')]

The mean-field policy gradient update:

θnJn(θn)Es,an,aˉn[θnlogπn(ans,aˉn;θn)Qn(s,an,aˉn;wn)]\nabla_{\theta^n} J^n(\theta^n) \approx E_{s, a^n, \bar a^n} [\nabla_{\theta^n} \log \pi^n(a^n|s, \bar a^n; \theta^n) \cdot Q^n(s, a^n, \bar a^n; w^n)]

Concurrently, PPO optimizes each intra-option policy πo\pi^o at the low level.

5. Learning and Execution Procedure

Training and execution of the MSMDP follows a hierarchical protocol. Each episode begins with environment reset and state observation. For each high-level time index ii, all agents sample options conditioned on both state and neighborhood mean. These options execute for Δ\Delta steps, during which low-level actions are decided via PPO, and cumulative team reward is collected.

After execution, agents update high-level value and policy parameters:

  • High-level TD error:

δn,iT=Ri+γΔVnT(si+1T)VnT(siT)\delta_{n,i}^T = R_i + \gamma^\Delta V_n^T(s_{i+1}^T) - V_n^T(s_i^T)

  • Critic gradient:

wnTwnT+βδn,iTwnTVnT(siT)w_n^T \leftarrow w_n^T + \beta \delta_{n,i}^T \nabla_{w_n^T} V_n^T(s_i^T)

  • Actor gradient:

θnTθnT+αδn,iTθnTlogπnT(an,iTsiT,aˉn,i1)\theta_n^T \leftarrow \theta_n^T + \alpha \delta_{n,i}^T \nabla_{\theta_n^T} \log \pi_n^T(a_{n,i}^T | s_i^T, \bar a_{n,i-1})

  • Mean-field Q-table (or network) update:

Qn(siT,ain,aˉi1n)(1α)Qn()+α[Rsum+γΔVn(si+1T)]Q^n(s_i^T, a_i^n, \bar a_{i-1}^n) \leftarrow (1-\alpha) Q^n(\cdots) + \alpha [R_{\text{sum}} + \gamma^\Delta V^n(s_{i+1}^T)]

The algorithm cycles through episodes, yielding performance metrics superior to non-hierarchical MARL and non-mean-field approaches (e.g., 5.93 avg. goals, 89.1% ball control, 92.3% passing accuracy in 4v4 Webots simulations).

6. Significance in Multi-Agent Reinforcement Learning

The MSMDP architecture directly addresses the curse of dimensionality in multi-agent RL, supports multi-granular temporal abstraction (macro-strategy vs. micro-execution), and is validated in adversarial stochastic domains. The mean-field actor-critic module provides a scalable trajectory planning solution, enabling robust cooperation and strategic behavior amongst many agents. A plausible implication is that this MSMDP-HRL paradigm generalizes to other domains characterized by frequent interaction, real-time constraints, and large homogeneous cohorts.

7. Summary Table: MSMDP Components (as formulated by Taourirte & Mia, 2024)

Component Symbol/Structure Description
Joint State Space siT=[P,Pgoal,ifc,Poppo]s_i^T = [P, P_{goal}, if_c, P^{oppo}] Concatenated agent, ball, and opponent data
Option Set On={okn}k=18O^n = \{o_k^n\}_{k=1}^8 Eight spatial directions per agent
Option Definition (Io,πo,βo)(I^o, \pi^o, \beta^o) Initiation, intra-option PPO, fixed duration
High-Level Reward R(0)(siT,o)R^{(0)}(s_i^T, o) Sum of low-level team rewards over Δ\Delta steps
Mean-Field Update Qn(s,an,aˉn)Q^n(s, a^n, \bar a^n) Aggregated population impact in TD learning

All components are instantiated to facilitate hierarchical, mean-field multi-agent RL as validated in robotic soccer simulation research (Taourirte et al., 2 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Agent Semi-Markov Decision Process (MSMDP).