Subgoal-based Hierarchical Reinforcement Learning for Multi-Agent Collaboration (2408.11416v1)

Published 21 Aug 2024 in cs.MA and cs.RO

Abstract: Recent advancements in reinforcement learning have made significant impacts across various domains, yet they often struggle in complex multi-agent environments due to issues like algorithm instability, low sampling efficiency, and the challenges of exploration and dimensionality explosion. Hierarchical reinforcement learning (HRL) offers a structured approach to decompose complex tasks into simpler sub-tasks, which is promising for multi-agent settings. This paper advances the field by introducing a hierarchical architecture that autonomously generates effective subgoals without explicit constraints, enhancing both flexibility and stability in training. We propose a dynamic goal generation strategy that adapts based on environmental changes. This method significantly improves the adaptability and sample efficiency of the learning process. Furthermore, we address the critical issue of credit assignment in multi-agent systems by synergizing our hierarchical architecture with a modified QMIX network, thus improving overall strategy coordination and efficiency. Comparative experiments with mainstream reinforcement learning algorithms demonstrate the superior convergence speed and performance of our approach in both single-agent and multi-agent environments, confirming its effectiveness and flexibility in complex scenarios. Our code is open-sourced at: \url{https://github.com/SICC-Group/GMAH}.

Summary

The paper introduces the GMAH algorithm, a novel hierarchical RL structure that autonomously generates dynamic subgoals to enhance multi-agent collaboration.
The approach integrates QMIX networks with adaptive goal generation using auto-encoders, leading to improved sample efficiency and stable convergence.
Experimental results in Mini-Grid and Trash-Grid environments demonstrate superior coordination and performance compared to state-of-the-art baselines.

Subgoal-based Hierarchical Reinforcement Learning for Multi-Agent Collaboration

In recent advancements, reinforcement learning (RL) has shown significant potential in various domains; however, it often struggles within complex multi-agent environments. This paper introduces a novel hierarchical architecture that improves training stability and efficiency by autonomously generating effective subgoals, thereby advancing multi-agent collaboration.

Introduction

The research community has encoded numerous successes in RL, merging Q-learning with neural networks to develop potent models capable of tackling complex domains such as autonomous driving and robotic navigation. Despite these achievements, constraints such as algorithm instability and low sampling efficiency remain intractable in multi-agent systems due to the widely known curse of dimensionality.

Figure 1: The overall GMAH structure.

Hierarchical reinforcement learning (HRL) addresses these challenges by leveraging abstract models to decompose tasks into simpler subtasks. This paper enhances HRL by integrating it with a QMIX network to address the critical issue of credit assignment in multi-agent systems while proposing a dynamic subgoal generation structure to adapt to environmental fluctuations.

Subgoal-based HRL Approach

The architecture (Figure 1) introduces a novel subgoal-based methodology for hierarchical learning in a multi-agent framework, termed the GMAH algorithm, to decompose tasks effectively.

General Architecture

The proposal segments an agent's policy into high-level and low-level functions. The high-level policy uses prior task knowledge to set strategic subgoals, whereas the low-level policy is incentivized through intrinsic rewards to achieve these subgoals. This hierarchical structure fosters enhanced task execution and coordination among agents.

Task-Tree Style Subgoal Generation

Task decomposition is operationalized through a task-tree structure, simplifying the learning process by managing the subgoal space. This refined approach significantly reduces the problem's dimensionality, focusing on the generation of explicit, attainable subgoals over abstract, cumbersome representations (Figure 2 and 3).

Figure 2: A typical diagram of task tree.

Figure 3: (a) High-Policy Network Model. (b) Hierarchical architecture applied by single agent in GMAH algorithm. (c) Network Model of GMAH.

Adaptive Goal Generation Strategy

Advancing further, the GMAH integrates a flexible goal update strategy that discards the traditional fixed-c interval in favor of adaptive adjustments based on environmental interactions (Figure 4). An auto-encoder refines these subgoals, ensuring response to significant environmental shifts, thus enhancing adaptability and communication-limited system efficiency.

Figure 4: (a) Trajectory example of agent c-step interaction. (b) Adjacency Constraint on Goal Space of HRAC. Goal Relabel on Trajectory of agent c-step interaction: (c) relabel the abstract subgoal, and (d) relabel subgoal of GMAH.

Fine-Tuning and Scaling Considerations

The GMAH uses a goal mixing network to manage multi-agent integration effectively (Figure 5). This component ensures efficient multi-agent interpolation through joint goal functions, which improve credit assignment quality and foster cohesive agent strategies.

Figure 6: Auto-Encoder with Successor Feature Correction.

Figure 5: (a) Goal Mixing Network. (b) Hyper-Net and Weight-Network.

Experimental Analysis

The GMAH algorithm's performance was critically assessed within controlled environments, Mini-Grid for single-agent systems, and Trash-Grid for multi-agent collaboration scenarios, revealing several insights into its empirical advantages.

Mini-Grid Environment

In the Mini-Grid setup, GMAH demonstrated superior sample efficiency and convergence speed compared to conventional RL algorithms such as PPO and A2C. Key results in Figure 7 and 9 highlight GMAH's capacity to maintain high performance from the onset of training, crucially outperforming baselines.

Figure 8: Mini-Grid Door Key Environment Diagram: (a) in the same room, and (b) in different rooms.

Figure 7: Result of GMAH-adapt, GMAH-no, PPO and A2C during training with $T=256$ .

Figure 9: Min Reward of GMAH-adapt and GMAH-no during training with $T=256$ .

Trash-Grid Environment

In a multi-agent Trash-Grid environment (Figure 10 and 14), the hierarchical architecture proved advantageous for dynamic task allocation among agents, improving collective efficiency in task execution compared to MAPPO, QMIX, and COMA. Enhanced exploration and data utilization efficiency were evidenced by heatmap data (Figure 11 and 16).

Figure 10: Tras-Grid environment diagram.

Figure 12: Training result of GMAH, MAPPO, COMA and QMIX in Trash-Grid environment with $T=128$ . (a) The experimental results of training $1,000,000$ steps by four methods, (b) Results of $100,000$ steps of fine-tuning training for GMAH using a goal mixing network.

Figure 11: During the training of MAPPO, the heat map counts the movement trajectories of the agents (the number of visits to each grid) in several episodes, representing the results of the three agents from left to right.

Figure 13: During the training of GMAH, the heat map counts the movement trajectories of the agents (the number of visits to each grid) in several episodes, representing the results of the three agents from left to right.

Conclusion

By establishing a robust hierarchical learning architecture, the GMAH algorithm markedly enhances multi-agent collaboration through sophisticated goal abstraction and dynamic adaptability. This method's efficiency in decomposing complex tasks into manageable subtasks is substantiated across various experimental setups, positioning GMAH as a promising approach for future RL research in dynamic, multi-agent systems. The findings invite further investigation into improving low-level policy efficacy and refining goal adaptation mechanisms.