CAPE: Diffusion Policy for Collision Avoidance

Updated 4 December 2025

The paper demonstrates that CAPE overcomes limited collision diversity in training by iteratively expanding trajectory modes using guided diffusion.
It employs a prior-seeded, iterative guided refinement that integrates collision-aware cost functions to ensure safe robot trajectory generation.
Experimental results reveal that CAPE significantly improves success rates in both simulated and real-world tasks compared to traditional diffusion methods.

Context-Aware Diffusion Policy via Proximal Mode Expansion (CAPE) is a framework for collision avoidance in imitation learning that leverages diffusion models to generate context-aware, multimodal robot trajectories. Standard diffusion policies often struggle to generalize in test-time environments with novel obstacles, as training data typically lacks sufficient real-world collision diversity. CAPE overcomes this limitation by employing a novel prior-seeded iterative guided refinement procedure, which iteratively expands the support of the learned trajectory distribution in context-relevant (collision-free) directions. This enables the generation and adaptation of trajectories that avoid collisions—even in previously unseen environments—while preserving consistency with the original motion intent (Yang et al., 27 Nov 2025).

1. Problem Setting and Methodological Innovations

CAPE addresses the challenge of learning a trajectory policy $p(\tau\mid\mathbf O)$ that can generalize to execute tasks such as pick-and-place or reaching with avoidance of unforeseen collisions. Standard diffusion methods, while powerful for producing diverse behaviors, tend to concentrate probability mass around high-density "modes" induced by the limited diversity of training demonstrations (typically in obstacle-free settings), resulting in poor generalization to cluttered or novel scenarios.

The principal innovation of CAPE is the iterative expansion of trajectory distribution modes through context-aware priors and inference-time guidance. Instead of single-pass inference with potentially brittle guidance, CAPE alternates between planning a trajectory, executing a prefix, re-noising the remaining suffix to form a context-dependent prior at an intermediate noise level, and applying guided denoising to refine this prior in light of observed obstacles. This process preserves task intent and iteratively expands mode support into previously unexplored, collision-free trajectory regions.

2. Mathematical Formulation

CAPE relies on the mathematical machinery of Denoising Diffusion Probabilistic Models (DDPMs) for trajectory generation. A trajectory $\tau = [x_1, \dots, x_N]$ consists of waypoints $x \in \mathbb{R}^d$ , with each state encapsulating end-effector position $p\in\mathbb{R}^3$ and 6D continuous rotation $r\in\mathbb{R}^6$ . The context $\mathbf O = (\ell, o)$ comprises a task/language descriptor $\ell = \{s_s, s_g\}$ (start/goal poses) and a recent trajectory history $o\in\mathbb{R}^{H\times d}$ .

DDPM Sampling and Guidance

The forward noising is defined as: $q(\tau_t\mid\tau_{t-1})=\mathcal{N}\left(\sqrt{\alpha_t}\,\tau_{t-1},\;\beta_t I\right)$ The standard reverse step (no guidance) computes the mean via: $\mu_t(\tau_t) = \frac{1}{\sqrt{\alpha_t}}\left(\tau_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\,\epsilon_\theta(\tau_t, t, \mathbf O)\right)$ The guided reverse transition augments this with a collision loss $L(\tau_t, t, \mathbf O)$ and guidance strength $\lambda$ : $\tau_{t-1} = \mu_t(\tau_t) + \lambda\,\nabla_{\tau_t}L(\tau_t, t, \mathbf{O}) + \sigma_t z \tag{1} \label{eq:guided_step}$ where $z\sim\mathcal{N}(0, I)$ .

Context-Aware Prior and Proximal Mode Expansion

Following execution of a prefix of length $m$ , the unexecuted suffix $\widetilde{\tau}^k_0$ is re-noised to a chosen intermediate noise level $\delta$ to construct a context-aware prior: $\widetilde{\tau}^k_\delta = \sqrt{\bar\alpha_\delta}\,\widetilde{\tau}^k_0 + \sqrt{1-\bar\alpha_\delta}\,\epsilon, \quad \epsilon\sim\mathcal{N}(0, I)$ This operation produces a prior that preserves anisotropic structures of the previously denoised trajectory while injecting controlled stochasticity. Bayes-ratio analysis formalizes how this procedure anchors the search in previously expanded, collision-avoiding modes: $\frac{q_0(\tau\mid \mathbf O, \widetilde\tau_\delta, \lambda)}{q_0(\tau\mid \mathbf O, \lambda)} = \frac{q_\delta(\widetilde\tau_\delta\mid \mathbf O, \tau, \lambda)}{q_\delta(\widetilde\tau_\delta\mid \mathbf O, \lambda)} \tag{2}$ Within neighborhoods of the denoised prior, this ratio is greater than unity, concentrating the sampling measure in these task-consistent zones.

Algorithms

The two core procedures are:

GuidedDenoising: Implements a guided, boundary-clamped reverse diffusion pass over the trajectory, with collision-aware guidance and enforcement of fixed start-goal endpoints.
CAPE_Planning: Executes the main iterative process, comprising initial plan generation from a Gaussian, prefix execution, suffix prior formation, guided denoising of the prior, successive prefix executions, and repetition until task completion or horizon exhaustion.

Pseudocode (in Python-style notation):

def GuidedDenoising(tau_t, t_start, chi, O, P_obs, lambda):
    tau = tau_t
    for t in range(t_start, 0, -1):
        mu = (1/√α[t]) * (tau - ((1-α[t]) / √(1-ᾱ[t])) * εθ(tau, t, O))
        if t <= chi:
            grad = -∇_tau L_collision(tau, P_obs)
            tau = mu + lambda * grad + σ[t] * N(0, I)
        else:
            tau = mu + σ[t] * N(0, I)
        tau[0] = O.ℓ.start
        tau[-1] = O.ℓ.goal
    return tau

def CAPE_Planning(O, P_obs, λ, δ, χ, m):
    first = True
    τ_prev = None
    while not done:
        if first:
            τ_T = sample N(0, I), clamped at start/goal
            τ_0 = GuidedDenoising(τ_T, T, χ, O, P_obs, λ)
            first = False
        else:
            suffix = τ_prev[m:]
            suffix = sprime_start⟶suffix⟶sprime_goal
            τ_δ = √ᾱ[δ] * suffix + √(1-ᾱ[δ]) * N(0, I)
            τ_0 = GuidedDenoising(τ_δ, δ, χ, O, P_obs, λ)
        execute(τ_0[:m])
        τ_prev = τ_0
        if reached_goal or max_iter:
            done = True
    return success

3. Collision-Aware Guidance and Goal Consistency

CAPE enforces collision avoidance through a signed distance–based collision cost. For the end-effector at position $p$ , using the signed distance $d(p)$ to obstacle point cloud $\mathbf P_{\text{obs}}$ , the loss is

$L(p) = \begin{cases} -\,d(p) + (\epsilon + r_{\text{eef}}) & \text{if } d(p) \leq \epsilon + r_{\text{eef}} \ 0 & \text{otherwise} \end{cases}$

where $\epsilon$ is a safety margin and $r_{\text{eef}}$ is end-effector radius. The guidance gradient $\nabla_{\tau_t} L$ repels trajectory waypoints from the collision margin. Enforcing start/goal boundary clamping at each denoising step ensures goal consistency.

4. Experimental Evaluation

Tasks and Environments

CAPE is assessed in a suite of simulated and real-world tasks:

Env 1: Conceptual 2D obstacle environment for visualization.
Env 2–4: Simulated tabletop environments (ManiSkill2) of ascending difficulty, from 25 small blocks (Easy) to complex hybrid clutter with partial observability (from wrist-camera).
Real-World: Pick-and-place (cup in clutter) and tape roll pick, featuring significant perceptual occlusion and clutter.

Dataset and Training

1,000 RRT-generated trajectories in obstacle-free simulation, with randomized start resampling.
Trajectories normalized to length $N=32$ ; history $H=8$ .
U-Net $\epsilon_\theta$ trained for 80 epochs with learning rate $10^{-4}$ , DDPM $\ell_2$ loss.

Baselines

Baseline	Description
DP + Guidance	Single-pass diffusion policy with SDF-based guidance
MPD	Motion Planning Diffusion; one-shot guided sampling
MPD+Refine	MPD with iterative Gaussian refines (1–2 Hz)
CAPE	Prior-seeded, context-aware refinement (4–5 Hz)

Metrics

Success Rate (SR): Fraction of episodes reaching the goal without collision.
Collision Rate (CR): Fraction with any collision.
Non-completion Rate (NCR): Fraction failing to reach the goal without collision; $SR + CR + NCR = 1$ .

5. Quantitative Results and Ablations

Simulated Results

DP+Guidance: NCR ≥ 79 %, reflecting learned mode–guidance conflict and poor completion.
MPD: 96 % SR on easy, degrades to 36 % SR on medium complexity with partial view.
MPD+Refine: Moderate improvement (50 % SR on medium partial view) but re-plans from Gaussian each time.
CAPE: Peaks at 0.82 SR (Env 3 full) vs. 0.66 for MPD; 0.76 SR (Env 3 limited) vs. 0.50 for MPD, indicating +26 % and +40 % improvement over alternatives.

Real-World Results

Pick-and-place cup task: CAPE and MPD+Refine both achieve 1.00 SR; MPD achieves 0.80.
Pick tape task: CAPE achieves 0.80 SR, compared to 0.00 for MPD and 0.20 for MPD+Refine; a relative increase of +80 % (MPD), +60 % (MPD+Refine).

Ablation Highlights

Guidance Strength ( $\lambda$ ): CAPE shows low sensitivity, maintaining high SR even at weak guidance, unlike MPD.
Prefix Length ( $m$ ) and Noise Level ( $\delta$ ): Best performance at $m=2$ , $\delta=2$ ; frequent replanning with moderate noise yields optimal expansion and stability.
Guidance Start Step ( $\chi$ ): Best results at $\chi=5$ .

6. Generalization Properties and Limitations

CAPE demonstrates significant generalization beyond the training distribution by iteratively expanding its support set in directions determined by the current collision context, enabling successful sampling of collision-free trajectories for previously unseen obstacle configurations. Training involves only obstacle-free environments, yet generalization is achieved to cluttered, unseen settings.

Limitations include sensitivity to quality of the very first prior—if this is substantially off-distribution, global re-initialization may be required. Additionally, current guidance considers only the end-effector, not full manipulator configurations. Extending guidance to entire robot geometry or to multi-arm systems remains an open direction.

7. Future Directions

Potential avenues for future advancement include automatic prior quality monitoring with periodic global replanning, extension of the collision cost function to encompass full-body or multi-agent scenarios, and the integration of learned guidance functions such as neural signed-distance fields suitable for non-rigid or deformable obstacle representations (Yang et al., 27 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

CAPE: Context-Aware Diffusion Policy Via Proximal Mode Expansion for Collision Avoidance (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Context-Aware Diffusion Policy via Proximal Mode Expansion (CAPE).