Papers
Topics
Authors
Recent
Search
2000 character limit reached

cuRoboV2: GPU-Native Motion Framework

Updated 12 March 2026
  • cuRoboV2 is a fully GPU-native motion-generation framework that unifies global planning, reactive control, and whole-body retargeting for high-degree-of-freedom robots.
  • It leverages B-spline trajectory optimization and a block-sparse TSDF-to-dense ESDF perception pipeline, achieving significant speedup and memory efficiency improvements over traditional methods.
  • The framework integrates scalable GPU-native kinematics, dynamics, and self-collision modules, enabling real-time trajectory synthesis and precise robotic manipulation.

cuRoboV2 is a fully GPU-native motion-generation framework designed for safe, feasible, and reactive trajectory synthesis for high-degree-of-freedom (DoF) robots. It unifies global planning, reactive control, and whole-body retargeting, overcoming fragmentations in traditional pipelines where fast planners, reactive controllers, and high-DoF solvers are decoupled or insufficiently scalable. cuRoboV2 integrates B-spline trajectory optimization, a block-sparse TSDF-to-dense ESDF perception pipeline, and scalable, topology-aware GPU-native kinematics, dynamics, and self-collision modules into a single, high-performance stack, providing significant advances in memory efficiency, computation speed, and practical deployment across manipulation arms and full humanoids (Sundaralingam et al., 5 Mar 2026).

1. B-spline Trajectory Optimization

cuRoboV2 represents each joint trajectory θt∈Rd\theta_t\in\mathbb{R}^d by a uniform cubic B-spline parameterized over KK control points U∈Rd×KU\in\mathbb{R}^{d\times K}. The robot state vector at each time tt is given by

$\Theta_t = [\theta_t, \dot\theta_t, \ddot\theta_t, \dddot\theta_t]^T = T \cdot P(\alpha) \cdot C \cdot U_k$

where TT, P(α)P(\alpha), and CC are fixed matrices derived from time discretization and spline theory, and UkU_k are four adjacent control points spanning the knot interval. This construction ensures C2C^2-continuous splines, requiring only weak regularization on θ¨t\ddot\theta_t and θ˙t\dot\theta_t.

The optimization cost at each time tt is:

L=∑tγs∥θ¨t∥2+γℓ∥θ˙t∥2+γe∥(θ˙t⊙τt)⋅dt∥2L = \sum_t \gamma_s \|\ddot{\theta}_t\|^2 + \gamma_\ell \|\dot{\theta}_t\|^2 + \gamma_e \|(\dot{\theta}_t \odot \tau_t) \cdot dt\|^2

subject to B-spline interpolation, joint and torque bounds, collision constraints via ESDF queries, goal accuracy, and early-stop criteria. Gradients ∂L/∂Uk\partial L/\partial U_k are computed using the chain rule, exploiting that each UkU_k influences only four knot spans, allowing GPU threads to accumulate in parallel and fuse computation and reduction within warps for memory efficiency.

The optimization proceeds in iterations: parallel time sampling of the B-spline, forward kinematics and inverse dynamics evaluation (RNEA) for link poses, torques, and Jacobians, parallel cost stream evaluation (scene collision, self-collision, bounds, energy), backpropagation, and parameter update via L-BFGS (or LM in IK). Implicit handling of trajectory boundaries ensures proper velocity and acceleration at trajectory endpoints.

2. GPU-Native TSDF → Dense ESDF Perception

The perception pipeline partitions the workspace into 838^3-voxel blocks managed in a CAS-protected hash pool, allocating only blocks within a truncation band μ\mu (e.g., 2 cm) around observed surfaces. Each voxel encodes two float16 channels: (depth_sum, depth_weight) for fusion and a geometry SDF for analytic primitives. Voxel-project integration parses depth images: per-pixel threads compute signed distances for affected blocks and update averages without atomics. Frustum-aware temporal decay manages block lifetimes, maintaining active memory under 400 MB for 100k blocks.

To enable whole-workspace signed distance queries, cuRoboV2 synthesizes dense ESDF from the sparse TSDF via an on-demand, three-stage Parallel Banding Algorithm Plus (PBA+):

  1. Gather seeding: 3D grid cells are stenciled to locate "site" cells near TSDF zero-crossings.
  2. Distance propagation: Three sweeps (Z, Y, X axes) perform flood propagation and exact Euclidean transform via Maurer’s parabola intersection, requiring only five kernel launches and O(N)O(N) work.
  3. Sign recovery: Unsigned distances are signed via analysis of geometry and TSDF/ESDF consistency.

This approach yields up to 10× speedup and 8× lower memory consumption over nvblox, with 99% collision recall at millimeter-scale ESDF resolution.

3. Scalable GPU-Native Whole-Body Computation

cuRoboV2 deploys topology-aware kinematic and dynamic computation, efficiently scaling to robots with tens of DoF:

  • Forward Kinematics: Adaptive dispatch runs efficient fused kernels for small robots (≤100 spheres) and decomposes computation on larger robots, separating transform calculation and sphere/Jacobian evaluation across many threads.
  • Gradient Backpropagation: Warps assign threads to links with nonzero gradients, leveraging a topology cache for efficient ancestor lookup and warp synchronization (without global atomics) for joint gradient calculation.
  • Inverse Kinematics (IK): Sparse Jacobian construction exploits mechanical coupling and sub-tree pruning using coarse "affects" masks and link chains, supporting scalability in LM-based optimization.

The framework's differentiable RNEA is a CUDA-Python kernel operating at runtime on arbitrary robot models (including mimic joints and runtime payload changes), with spatial transforms and inertia stored in compact factored forms. The forward (base-to-tips) and backward (tips-to-base) passes compute velocities, accelerations, and joint torques, supporting vector-Jacobian product backward differentiation in O(n)O(n) work. For self-collision, a two-stage map-reduce evaluates NpairsN_{pairs} sphere-sphere potential penetrations (up to 162k on 48-DoF), partitioned by GPU block: local maxima are computed and reduced to a global maximum, converting a memory-bound routine to shared-memory compute-boundedness.

4. Implementation Patterns and Codebase Design

The codebase is structured for discoverability and productivity, enabling coding assistants to generate novel CUDA kernels and APIs. Key features include:

  • Strong typing and self-documenting dataclasses, eschewing hidden YAML configs.
  • Single-responsibility files (<300 lines), consistent use of __all__, comprehensive testing (>4k documented tests).
  • Installation simplicity via pip install cuRoboV2, with CUDA kernels compiled at first use using CUDA Python, decoupling installation from CUDA/PyTorch versions.
  • Category-first filenames and CamelCase classes, with documented default constructors.
  • As a result, up to 73% of new modules (including hand-optimized CUDA kernels) were generated by an LLM assistant, illustrating the codebase's compatibility with assisted software creation.

Critical performance kernels include interleaved-warp B-spline gradient computation, atomics-free voxel TSDF integration, separable PBA+ ESDF propagation, shared-memory RNEA execution, and shared-memory map-reduce for self-collision.

5. Empirical Performance and Benchmarks

cuRoboV2 demonstrates both superior accuracy and scalability:

Task cuRoboV2 Result Baselines
Payload-aware planning (3 kg, Franka) 99.7% success (2,600 problems) cuRobo v1: 77.1%, samplers: 72–75%
IK on 48-DoF humanoid (collision-free) 99.6% success cuRobo v1/PyRoki: 0%
IK speed (48-DoF, standard) 100% in 34 ms Newton: 98.4% in 317 ms, PyRoki: 49.8% in 1123 ms
Self-collision (48-DoF, speedup) 61× vs cuRobo v1
Humanoid retargeting constraint satsfctn IK: 89.5%, MPC: 96.6% PyRoki: 61.2%, mink: 40.6%
Locomotion (running, 32 resets) 0 resets, 139 mm MPJPE PyRoki: 32 resets, 183 mm MPJPE
ESDF (Redwood, 10mm, full workspace) 1.69 ms, 1.63 GB, 92.5% recall nvblox: 12.68 ms, 11.87 GB, 92%

Performance advantages stem from full-coverage ESDF, scalable inverse dynamics (up to 14–18× speedup), topology-aware computation, and whole-body collision avoidance enabling policies with 21% lower tracking error and 12× lower cross-seed variance than prior frameworks.

6. Limitations and Future Directions

Current cuRoboV2 limitations include approximate camera segmentation (potentially addressable via learned RGB masks), single-camera workspace coverage (multi-view fusion is pending), and lack of end-to-end B-spline-based reactive planning via global-to-local MPC warm-starting. LLM-assisted code generation remains contingent on human expert interpretation of hardware feedback (such as register pressure), highlighting a need for improved automated understanding of CUDA performance diagnostics. Extending RNEA for differentiable contact and friction effects (enabling contact-implicit planning) is a proposed pathway for future extension. These limitations circumscribe further unification of perception, planning, and policy stacks for high-DoF systems (Sundaralingam et al., 5 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to cuRoboV2.