Multi-Agent Semi-Markov Decision Process (MSMDP)
- MSMDP is a formal framework for modeling trajectory-level decision-making among agents using temporally extended macro-actions.
- It integrates hierarchical reinforcement learning with mean-field actor-critic methods to enable scalable strategy selection in complex environments like robotic soccer.
- The framework employs high-level and low-level policies to optimize team rewards through spatial awareness and temporal abstraction, promoting advanced cooperation.
A Multi-Agent Semi-Markov Decision Process (MSMDP) is a formal framework for modeling trajectory-level decision-making among multiple agents where macro-actions (options) are temporally extended. In the hierarchical reinforcement learning (HRL) architecture proposed by Taourirte & Mia (2024), the MSMDP captures high-level strategy selection for teams of homogeneous agents, specifically applied to robotic soccer in adversarial, real-time, multi-agent virtual environments (Taourirte et al., 2 Dec 2025).
1. Mathematical Formulation of MSMDP
The MSMDP is structured as a tuple , defining the joint state, options, transition kernel, reward function, and discount factor. At each high-level decision index , the global state aggregates agents' and ball attributes:
- : Teammates' positions
- : Ball position
- : Indicator vector for ball control per agent
- : Opponents’ positions
Each agent selects a high-level option from , corresponding to eight coarse trajectory directions over the next low-level steps. Each option is defined by a triple :
- Initiation set (all states permit any directional option)
- Intra-option policy , implemented via PPO at the low level
- Termination (fixed step duration)
The semi-Markov kernel is with determined after steps of low-level interaction, reflecting the options' temporal extension.
2. High-Level Reward Design
The high-level reward aggregates instantaneous low-level team rewards accrued over steps for each option:
Where:
- , being robot-to-ball distance
- , denoting non-holder’s distance to opponent goal, the neighborhood of agent
Hence, for an option taken at state and ending at ,
This design incentivizes scoring, ball proximity, and control, with explicit spatial awareness embedded.
3. Value Functions and Hierarchical Policy Objective
Let denote the high-level stochastic policy over options. The HRL scheme aims to maximize the expected discounted sum of rewards every high-level steps:
This is equivalently expressed in low-level timesteps:
Bellman equations for SMDP option and state value functions are:
These express the modular temporal abstraction underpinning high-level policy learning.
4. Mean-Field Actor-Critic Integration for Scalability
To scale the architecture for large , agent approximates the collective impact of peers via the mean option . This simplifies multi-agent interactions to agent-vs-population averages and permits stable learning as shown in robotic soccer experiments (Taourirte et al., 2 Dec 2025).
The high-level mean-field Q-update:
where
The mean-field policy gradient update:
Concurrently, PPO optimizes each intra-option policy at the low level.
5. Learning and Execution Procedure
Training and execution of the MSMDP follows a hierarchical protocol. Each episode begins with environment reset and state observation. For each high-level time index , all agents sample options conditioned on both state and neighborhood mean. These options execute for steps, during which low-level actions are decided via PPO, and cumulative team reward is collected.
After execution, agents update high-level value and policy parameters:
- High-level TD error:
- Critic gradient:
- Actor gradient:
- Mean-field Q-table (or network) update:
The algorithm cycles through episodes, yielding performance metrics superior to non-hierarchical MARL and non-mean-field approaches (e.g., 5.93 avg. goals, 89.1% ball control, 92.3% passing accuracy in 4v4 Webots simulations).
6. Significance in Multi-Agent Reinforcement Learning
The MSMDP architecture directly addresses the curse of dimensionality in multi-agent RL, supports multi-granular temporal abstraction (macro-strategy vs. micro-execution), and is validated in adversarial stochastic domains. The mean-field actor-critic module provides a scalable trajectory planning solution, enabling robust cooperation and strategic behavior amongst many agents. A plausible implication is that this MSMDP-HRL paradigm generalizes to other domains characterized by frequent interaction, real-time constraints, and large homogeneous cohorts.
7. Summary Table: MSMDP Components (as formulated by Taourirte & Mia, 2024)
| Component | Symbol/Structure | Description |
|---|---|---|
| Joint State Space | Concatenated agent, ball, and opponent data | |
| Option Set | Eight spatial directions per agent | |
| Option Definition | Initiation, intra-option PPO, fixed duration | |
| High-Level Reward | Sum of low-level team rewards over steps | |
| Mean-Field Update | Aggregated population impact in TD learning |
All components are instantiated to facilitate hierarchical, mean-field multi-agent RL as validated in robotic soccer simulation research (Taourirte et al., 2 Dec 2025).