Papers
Topics
Authors
Recent
Search
2000 character limit reached

KStar Diffuser: A Kinematic Diffusion Framework

Updated 2 February 2026
  • KStar Diffuser is a kinematics-enhanced conditional diffusion framework that employs dynamic spatial-temporal graph encoding to generate feasible bimanual manipulation trajectories.
  • It leverages a denoising diffusion probabilistic model and a differentiable forward kinematics module to substantially reduce self-collision and inverse kinematics failures.
  • Empirical evaluations demonstrate that KStar Diffuser markedly increases manipulation success rates and safety in both simulation and real-world settings.

KStar Diffuser is a conditional diffusion policy framework for bimanual robotic manipulation that integrates spatial-temporal graph structure encoding with differentiable kinematic constraints. This system addresses the shortcomings of previous end-to-end imitation learning approaches for bimanual robots, which lack explicit modeling of the robot’s joint interrelations and kinematic feasibility. KStar Diffuser expressly incorporates robot structure by constructing a dynamic spatial-temporal graph and enforces kinematics awareness through an optimizable forward kinematics module, resulting in substantially higher manipulation success rates and dramatically reduced self-collision and inverse kinematic failures in both simulation and real-world evaluation settings (Lv et al., 13 Mar 2025). The name "KStar" refers to Kinematics-enhanced Spatial-TemporAl gRaph Diffuser.

1. Diffusion Policy Architecture

KStar Diffuser formulates robot action generation as conditional score-based sampling using a denoising diffusion probabilistic model (DDPM), conditioned on high-dimensional multimodal observations and graph-structured robot state.

  • Let x0Rdx_0 \in \mathbb{R}^d represent the clean bimanual keyframe end-effector poses; oo denotes the conditioning observation, e.g., multiview RGB-D, language.
  • The forward (noising) process applies a Markov chain with a variance schedule {β1,...,βT}\{{\beta}_1, ..., {\beta}_T\}:

q(xtxt1)=N(xt;αtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t} x_{t-1}, \beta_t I)

where αt=1βt\alpha_t = 1-\beta_t, and αt=s=1tαs\overline{\alpha}_t = \prod_{s=1}^t \alpha_s.

  • Marginally:

q(xtx0)=N(xt;αtx0,(1αt)I)q(x_t|x_0) = \mathcal{N}(x_t; \sqrt{\overline{\alpha}_t} x_0, (1 - \overline{\alpha}_t) I)

  • The reverse (denoising) process samples xt1x_{t-1} by predicting added noise via a neural network ϵθ\epsilon_\theta:

pθ(xt1xt,o)=N(xt1;μθ(xt,t,o),βtI)p_\theta(x_{t-1}|x_t, o) = \mathcal{N}\big(x_{t-1}; \mu_\theta(x_t, t, o), \beta_t I\big)

with

μθ(xt,t,o)=1αt(xtβt1αtϵθ(xt,t,o))\mu_\theta(x_t, t, o) = \frac{1}{\sqrt{\alpha_t}}\Big(x_t - \frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}} \epsilon_\theta(x_t, t, o)\Big)

  • Training uses the simplified DDPM loss:

Ldiff=Ex0,t,ϵ[ϵϵθ(xt,t,o)2]\mathcal{L}_{\text{diff}} = \mathbb{E}_{x_0, t, \epsilon}\Big[\|\epsilon - \epsilon_\theta(x_t, t, o)\|^2\Big]

This approach allows for sampling of kinematically feasible action trajectories, with the conditioning embedding detailed below.

2. Dynamic Spatial–Temporal Graph Representation

To incorporate the robot's physical structure, KStar Diffuser encodes the system as a dynamic spatial-temporal (ST) graph, capturing interaction patterns at both spatial and temporal levels:

  • The spatial graph GS=(VS,ES)G_S = (V_S, E_S) is parsed from the robot’s URDF, with nodes viv_i for each joint and edges for physically linked joints.
  • Node features for each viv_i combine:
    • Joint workspace-normalized coordinates fiJC=(xi,yi,zi)f_i^{JC} = (x_i, y_i, z_i)
    • Pairwise joint-to-joint distances fiJD=(vivj)j=1mf_i^{JD} = (\|v_i-v_j\|)_{j=1\ldots m}
    • Body side one-hot label fiBL{left,right}f_i^{BL} \in \{\text{left}, \text{right}\}
  • The temporal graph GST=(VST,EST)G_{ST} = (V_{ST}, E_{ST}) stacks TT past timesteps, with temporal edges added for each joint across time.
  • The composite graph is processed by LL GCN layers to yield per-node and pooled global graph embeddings gRhg \in \mathbb{R}^h.
  • The denoiser ϵθ\epsilon_\theta then receives visual-language features HBH_B, the spatial-temporal graph feature HGH_G, and the kinematic reference HRH_R as conditioning, i.e.,

pθ(xt1xt,HB,HG,HR)p_\theta(x_{t-1}|x_t, H_B, H_G, H_R)

This explicit structural encoding enables collision avoidance and models coordination constraints inherent to bimanual systems.

3. Differentiable Kinematic Modeling

KStar Diffuser integrates a differentiable forward kinematics module to regularize the action outputs with respect to physical feasibility:

  • Let θ=(θ1,...,θn)\theta = (\theta_1, ..., \theta_n) represent joint angles; the forward kinematics mapping is

FK(θ)=T1(θ1)T2(θ2)Tn(θn)SE(3)\mathrm{FK}(\theta) = T_1(\theta_1) T_2(\theta_2) \cdots T_n(\theta_n) \in SE(3)

  • The network projects from concatenated backbone and graph features to a predicted joint configuration y^=Proj([HB,HG])\hat{y} = \text{Proj}([H_B, H_G]) and computes HR=FK(y^)H_R = FK(\hat{y}).
  • The kinematic loss is

Lkin=Eθ0[θ0y^2]\mathcal{L}_{\text{kin}} = \mathbb{E}_{\theta_0} \Big[ \|\theta_0 - \hat{y}\|^2 \Big]

where θ0\theta_0 is the ground-truth joint trajectory, enforcing joint-level accuracy and forward kinematic validity.

  • Gradients propagate via the kinematic Jacobian J(θ)=FK(θ)/θR6×nJ(\theta) = \partial FK(\theta)/\partial\theta \in \mathbb{R}^{6 \times n}, ensuring the model learns to produce kinematically valid and physically executable trajectories compatible with the robot's structure.

4. Training and Optimization Strategy

The training objective linearly combines the diffusion and kinematic losses:

L=λLdiff+(1λ)Lkin\mathcal{L} = \lambda \mathcal{L}_{\text{diff}} + (1-\lambda) \mathcal{L}_{\text{kin}}

where λ=0.9\lambda=0.9 is empirically chosen for maximal imitation performance. The diffusion process uses T=100T=100 steps, a single denoising iteration per batch, and optimizer AdamW (learning rate 2×1042 \times 10^{-4}, weight decay 1×1061 \times 10^{-6}, batch size 64, for 150k steps). The backbone encoders for vision and language features operate jointly with the spatial-temporal GCN and kinematics head, supporting end-to-end learning and inference (Lv et al., 13 Mar 2025).

5. Empirical Evaluation and Comparative Results

KStar Diffuser demonstrates markedly increased performance and safety over prior methods such as DP-EE and PerAct2:

Setting # Demonstrations KStar Success Best Baseline (DP-EE) PerAct2
RLBench2 Simulation 20 58.0% ~17.0% (not reported)
RLBench2 Simulation 100 68.2% ~40.5% <35%
Real-world (2 tasks) 15 trials 43.1% N/A 29.9%
Handover_item (sim) -- 23.4% N/A --
(- kinematics) ablation -- 16.8% N/A --
(-ST graph+kin.) ablation -- 14.8% N/A --

Additional qualitative outcomes:

  • Self-collisions and inter-arm collisions approach zero with the spatial-temporal graph, compared to approximately 15% collision rate in PerAct2.
  • Kinematic infeasibility (IK solver failure) below 5% for KStar Diffuser, versus around 20% for diffusion models lacking kinematic loss.

These results indicate substantial (>20 percentage point) improvements in manipulation success rate, combined with dramatically improved safety and feasibility characteristics.

6. Implications and Distinctions

KStar Diffuser’s integration of explicit robot spatial-temporal modeling and kinematic regularization addresses the two main pathologies of prior imitation learning systems for bimanual manipulation: neglect of the robot’s joint dependencies (leading to high collision/interference rates), and generation of infeasible action trajectories (yielding high inverse-kinematics solver failure rates) (Lv et al., 13 Mar 2025). A plausible implication is that the framework can generalize to other multi-limbed and high-DOF robotic systems where structural coordination and feasibility are critical to task performance.

KStar Diffuser departs from prior diffusion-based and transformer-based policy architectures for manipulation in its coupling of dynamic graph encodings, physical robot model constraints, and multimodal observation conditioning, yielding both empirical robustness and improved physical realizability in action policy synthesis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KStar Diffuser.