Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Diffusion-Reinforcement Learning Hierarchical Motion Planning in Adversarial Multi-agent Games (2403.10794v1)

Published 16 Mar 2024 in cs.RO, cs.LG, and cs.MA

Abstract: Reinforcement Learning- (RL-)based motion planning has recently shown the potential to outperform traditional approaches from autonomous navigation to robot manipulation. In this work, we focus on a motion planning task for an evasive target in a partially observable multi-agent adversarial pursuit-evasion games (PEG). These pursuit-evasion problems are relevant to various applications, such as search and rescue operations and surveillance robots, where robots must effectively plan their actions to gather intelligence or accomplish mission tasks while avoiding detection or capture themselves. We propose a hierarchical architecture that integrates a high-level diffusion model to plan global paths responsive to environment data while a low-level RL algorithm reasons about evasive versus global path-following behavior. Our approach outperforms baselines by 51.2% by leveraging the diffusion model to guide the RL algorithm for more efficient exploration and improves the explanability and predictability.

Citations (5)

Summary

  • The paper introduces a novel hierarchical framework that integrates diffusion models for global path planning with reinforcement learning for reactive local control in adversarial, partially observable environments.
  • It employs a diffusion-based high-level planner to generate diverse, obstacle-avoiding paths and a Soft Actor-Critic agent to navigate evasive maneuvers against dynamic pursuers.
  • Experimental results in custom simulation domains and on a physical testbed demonstrate superior performance in goal-reaching and reduced detection rates compared to standard RL and heuristic methods.

This paper introduces a hierarchical motion planning framework combining diffusion models and reinforcement learning (RL) to control an evader agent in complex, partially observable, multi-agent pursuit-evasion games (PEGs). The goal is for the evader to reach one of several hideouts while avoiding detection by a team of pursuers (static cameras, dynamic search parties, helicopters) in large environments.

Problem:

Standard RL methods struggle in these large-scale, partially observable environments due to inefficient exploration and the difficulty of long-horizon planning. Traditional path planners like A* or RRT* don't inherently handle dynamic, adversarial pursuers or partial observability effectively. Diffusion models alone can generate paths but lack the reactive, evasive capabilities needed.

Proposed Solution: Diffusion-RL Hierarchy

The core idea is to use a diffusion model as a high-level global planner and an RL agent (specifically Soft Actor-Critic, SAC) as a low-level local controller.

  1. High-Level Global Planner (Diffusion Model):
    • Training: A diffusion model is trained on a dataset of successful paths generated by RRT*. RRT* is chosen because it naturally produces diverse paths even for the same start/goal points. The paths are downsampled to sparse waypoints before training. This makes the diffusion model learn the distribution of feasible, obstacle-avoiding paths between start locations and potential goals.
    • Function: During training and inference, the diffusion model generates candidate global paths (sequences of waypoints) that satisfy map constraints (start, goal, obstacles). It uses constraint guidance during sampling to ensure validity. Generating sparse waypoints makes sampling fast.
    • Benefit: Provides diverse, long-horizon guidance to the RL agent, constraining its exploration to promising regions of the state space and overcoming the exploration challenges of pure RL.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Pseudocode: Diffusion Model Training
dataset = []
while len(dataset) < dataset_size:
    start_pos, goal_pos, obstacles = sample_environment_constraints()
    rrt_path = RRT_star_plan(start_pos, goal_pos, obstacles)
    waypoints = downsample(rrt_path)
    dataset.append((waypoints, start_pos, goal_pos, obstacles))

diffusion_model = DiffusionModel()
optimizer = Adam(diffusion_model.parameters())

for epoch in range(num_epochs):
    waypoints, start, goal, obstacles = sample_batch(dataset)
    noise = torch.randn_like(waypoints)
    timestep = torch.randint(0, T, (batch_size,))
    noisy_waypoints = noise_scheduler.add_noise(waypoints, noise, timestep)
    
    # Predict the noise added
    predicted_noise = diffusion_model(noisy_waypoints, timestep, start, goal, obstacles)
    loss = mse_loss(predicted_noise, noise)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# Pseudocode: Diffusion Path Generation (Inference)
def generate_path(diffusion_model, start, goal, obstacles, num_steps=T):
    path_noise = torch.randn(1, num_waypoints, 2) # Example dimensions
    current_path = path_noise
    for t in reversed(range(num_steps)):
        predicted_noise = diffusion_model(current_path, t, start, goal, obstacles)
        # Denoise one step
        current_path = noise_scheduler.step(predicted_noise, t, current_path).prev_sample 
        # Apply constraints (e.g., project waypoints outside obstacles)
        current_path = apply_constraints(current_path, start, goal, obstacles) 
    return current_path

  1. Low-Level Local Controller (SAC):
    • Training: An SAC agent is trained to follow the waypoints provided by the high-level planner while simultaneously learning evasive maneuvers to avoid nearby pursuers.
    • Observation: The agent's observation includes its own state, pursuer states (if detected), hideout locations, and importantly, the current target waypoint from the diffusion plan.
    • Reward: The reward function encourages reaching the current waypoint (rgr_g), penalizes distance to it (rdr_d), and heavily penalizes being detected by pursuers (radvr_{adv}). r=rg+rd+radvr = r_g + r_d + r_{adv}.
    • Benefit: Handles the dynamic, adversarial aspect of the problem by learning reactive behaviors locally, guided by the global plan.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Pseudocode: RL Training Loop Step
waypoints = generate_path(diffusion_model, current_state.pos, goal_pos, obstacles)
current_waypoint_index = 0
state = env.reset()

while not done:
    target_waypoint = waypoints[current_waypoint_index]
    # Augment observation with target waypoint
    augmented_observation = np.concatenate([state.observation, target_waypoint]) 
    
    action = sac_policy.select_action(augmented_observation)
    next_state, reward, done, _ = env.step(action) 
    
    # Calculate custom reward based on waypoint progress and detection
    custom_reward = calculate_reward(state, action, next_state, target_waypoint, detection_status) 
    
    replay_buffer.add(state.observation, action, custom_reward, next_state.observation, done, target_waypoint)
    
    if reached_waypoint(next_state.pos, target_waypoint):
        current_waypoint_index += 1
        if current_waypoint_index >= len(waypoints):
            done = True # Reached final waypoint
    
    state = next_state
    
    if len(replay_buffer) > batch_size:
        sac_policy.update(replay_buffer, batch_size) # Train SAC networks

  1. Costmap Construction and Inference:
    • After training the hierarchy, rollouts are performed in the environment. The evader's location and its proximity to the nearest pursuer are recorded.
    • A 2D costmap is built by accumulating Gaussian distributions centered at locations where the evader was close to a pursuer (high detection risk).
    • During inference, the diffusion model generates multiple candidate global paths. The path integral cost is calculated for each path using the generated costmap.
    • The path with the lowest cost (representing the predicted safest path) is selected for the RL agent to follow.
    • This costmap implicitly captures information about unobservable threats (like camera locations) based on where detections tend to occur. It also allows adding ad-hoc danger zones.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Pseudocode: Costmap Update and Path Selection
costmap = np.zeros(map_dimensions)
risk_threshold = 5.0 # Example distance threshold
gaussian_std_dev = 1.0

# During post-training rollouts:
if distance_to_nearest_pursuer < risk_threshold:
    add_gaussian(costmap, evader_position, gaussian_std_dev)

# Normalize costmap after rollouts
costmap = costmap / np.max(costmap) 

# During Inference:
candidate_paths = [generate_path(diffusion_model, start, goal, obs) for _ in range(num_candidate_paths)]
best_path = None
min_cost = float('inf')

for path in candidate_paths:
    path_cost = integrate_cost_along_path(costmap, path)
    if path_cost < min_cost:
        min_cost = path_cost
        best_path = path
        
execute_plan(best_path, sac_policy)

Experiments and Results:

  • Evaluated in two custom domains: "Prisoner Escape" and "Narco Interdiction".
  • Metrics: Score (combining goal-reaching and detection avoidance), Detection Rate, Goal-Reaching Rate.
  • Baselines: Heuristics (A*, RRT*, VO), Learning-based (DDPG, SAC, Diffusion Only).
  • Findings:
    • The proposed Diffusion-RL significantly outperformed all baselines on the main score metric.
    • Standalone RL (SAC/DDPG) achieved very low goal-reaching rates, highlighting their struggle with exploration and long-term planning. SAC alone learned to stay safe by moving to map edges but failed to reach goals.
    • Heuristics and Diffusion Only achieved high goal-reaching but suffered from high detection rates, failing to learn effective evasion.
    • Using the costmap for path selection (Diffusion-RL-Map) further reduced detection rates compared to just Diffusion-RL.
    • The costmap visually correlated high-risk areas with hidden camera locations.
    • The diffusion model trained on RRT* paths produced more diverse plans than when trained on A* paths.
    • Diffusion path generation was shown to be significantly faster (order of magnitude) than RRT*, especially when generating multiple paths in parallel.
    • The method was successfully demonstrated on a physical multi-robot testbed (Robotarium).

Conclusion:

The paper successfully demonstrates that combining a diffusion-based global planner with an RL-based local controller creates a robust and effective framework for evasion tasks in challenging multi-agent, partially observable environments. The hierarchy leverages the strengths of both approaches: diffusion models for efficient, diverse long-horizon planning and RL for reactive, adaptive local control and evasion. The costmap adds a layer of interpretability and allows for safer path selection based on learned environmental risks.