cuRoboV2: GPU-Native Motion Framework

Updated 12 March 2026

cuRoboV2 is a fully GPU-native motion-generation framework that unifies global planning, reactive control, and whole-body retargeting for high-degree-of-freedom robots.
It leverages B-spline trajectory optimization and a block-sparse TSDF-to-dense ESDF perception pipeline, achieving significant speedup and memory efficiency improvements over traditional methods.
The framework integrates scalable GPU-native kinematics, dynamics, and self-collision modules, enabling real-time trajectory synthesis and precise robotic manipulation.

cuRoboV2 is a fully GPU-native motion-generation framework designed for safe, feasible, and reactive trajectory synthesis for high-degree-of-freedom (DoF) robots. It unifies global planning, reactive control, and whole-body retargeting, overcoming fragmentations in traditional pipelines where fast planners, reactive controllers, and high-DoF solvers are decoupled or insufficiently scalable. cuRoboV2 integrates B-spline trajectory optimization, a block-sparse TSDF-to-dense ESDF perception pipeline, and scalable, topology-aware GPU-native kinematics, dynamics, and self-collision modules into a single, high-performance stack, providing significant advances in memory efficiency, computation speed, and practical deployment across manipulation arms and full humanoids (Sundaralingam et al., 5 Mar 2026).

1. B-spline Trajectory Optimization

cuRoboV2 represents each joint trajectory $\theta_t\in\mathbb{R}^d$ by a uniform cubic B-spline parameterized over $K$ control points $U\in\mathbb{R}^{d\times K}$ . The robot state vector at each time $t$ is given by

$\Theta_t = [\theta_t, \dot\theta_t, \ddot\theta_t, \dddot\theta_t]^T = T \cdot P(\alpha) \cdot C \cdot U_k$

where $T$ , $P(\alpha)$ , and $C$ are fixed matrices derived from time discretization and spline theory, and $U_k$ are four adjacent control points spanning the knot interval. This construction ensures $C^2$ -continuous splines, requiring only weak regularization on $\ddot\theta_t$ and $\dot\theta_t$ .

The optimization cost at each time $t$ is:

$L = \sum_t \gamma_s \|\ddot{\theta}_t\|^2 + \gamma_\ell \|\dot{\theta}_t\|^2 + \gamma_e \|(\dot{\theta}_t \odot \tau_t) \cdot dt\|^2$

subject to B-spline interpolation, joint and torque bounds, collision constraints via ESDF queries, goal accuracy, and early-stop criteria. Gradients $\partial L/\partial U_k$ are computed using the chain rule, exploiting that each $U_k$ influences only four knot spans, allowing GPU threads to accumulate in parallel and fuse computation and reduction within warps for memory efficiency.

The optimization proceeds in iterations: parallel time sampling of the B-spline, forward kinematics and inverse dynamics evaluation (RNEA) for link poses, torques, and Jacobians, parallel cost stream evaluation (scene collision, self-collision, bounds, energy), backpropagation, and parameter update via L-BFGS (or LM in IK). Implicit handling of trajectory boundaries ensures proper velocity and acceleration at trajectory endpoints.

2. GPU-Native TSDF → Dense ESDF Perception

The perception pipeline partitions the workspace into $8^3$ -voxel blocks managed in a CAS-protected hash pool, allocating only blocks within a truncation band $\mu$ (e.g., 2 cm) around observed surfaces. Each voxel encodes two float16 channels: (depth_sum, depth_weight) for fusion and a geometry SDF for analytic primitives. Voxel-project integration parses depth images: per-pixel threads compute signed distances for affected blocks and update averages without atomics. Frustum-aware temporal decay manages block lifetimes, maintaining active memory under 400 MB for 100k blocks.

To enable whole-workspace signed distance queries, cuRoboV2 synthesizes dense ESDF from the sparse TSDF via an on-demand, three-stage Parallel Banding Algorithm Plus (PBA+):

Gather seeding: 3D grid cells are stenciled to locate "site" cells near TSDF zero-crossings.
Distance propagation: Three sweeps (Z, Y, X axes) perform flood propagation and exact Euclidean transform via Maurer’s parabola intersection, requiring only five kernel launches and $O(N)$ work.
Sign recovery: Unsigned distances are signed via analysis of geometry and TSDF/ESDF consistency.

This approach yields up to 10× speedup and 8× lower memory consumption over nvblox, with 99% collision recall at millimeter-scale ESDF resolution.

3. Scalable GPU-Native Whole-Body Computation

cuRoboV2 deploys topology-aware kinematic and dynamic computation, efficiently scaling to robots with tens of DoF:

Forward Kinematics: Adaptive dispatch runs efficient fused kernels for small robots (≤100 spheres) and decomposes computation on larger robots, separating transform calculation and sphere/Jacobian evaluation across many threads.
Gradient Backpropagation: Warps assign threads to links with nonzero gradients, leveraging a topology cache for efficient ancestor lookup and warp synchronization (without global atomics) for joint gradient calculation.
Inverse Kinematics (IK): Sparse Jacobian construction exploits mechanical coupling and sub-tree pruning using coarse "affects" masks and link chains, supporting scalability in LM-based optimization.

The framework's differentiable RNEA is a CUDA-Python kernel operating at runtime on arbitrary robot models (including mimic joints and runtime payload changes), with spatial transforms and inertia stored in compact factored forms. The forward (base-to-tips) and backward (tips-to-base) passes compute velocities, accelerations, and joint torques, supporting vector-Jacobian product backward differentiation in $O(n)$ work. For self-collision, a two-stage map-reduce evaluates $N_{pairs}$ sphere-sphere potential penetrations (up to 162k on 48-DoF), partitioned by GPU block: local maxima are computed and reduced to a global maximum, converting a memory-bound routine to shared-memory compute-boundedness.

4. Implementation Patterns and Codebase Design

The codebase is structured for discoverability and productivity, enabling coding assistants to generate novel CUDA kernels and APIs. Key features include:

Strong typing and self-documenting dataclasses, eschewing hidden YAML configs.
Single-responsibility files (<300 lines), consistent use of __all__, comprehensive testing (>4k documented tests).
Installation simplicity via pip install cuRoboV2, with CUDA kernels compiled at first use using CUDA Python, decoupling installation from CUDA/PyTorch versions.
Category-first filenames and CamelCase classes, with documented default constructors.
As a result, up to 73% of new modules (including hand-optimized CUDA kernels) were generated by an LLM assistant, illustrating the codebase's compatibility with assisted software creation.

Critical performance kernels include interleaved-warp B-spline gradient computation, atomics-free voxel TSDF integration, separable PBA+ ESDF propagation, shared-memory RNEA execution, and shared-memory map-reduce for self-collision.

5. Empirical Performance and Benchmarks

cuRoboV2 demonstrates both superior accuracy and scalability:

Task	cuRoboV2 Result	Baselines
Payload-aware planning (3 kg, Franka)	99.7% success (2,600 problems)	cuRobo v1: 77.1%, samplers: 72–75%
IK on 48-DoF humanoid (collision-free)	99.6% success	cuRobo v1/PyRoki: 0%
IK speed (48-DoF, standard)	100% in 34 ms	Newton: 98.4% in 317 ms, PyRoki: 49.8% in 1123 ms
Self-collision (48-DoF, speedup)	61× vs cuRobo v1
Humanoid retargeting constraint satsfctn	IK: 89.5%, MPC: 96.6%	PyRoki: 61.2%, mink: 40.6%
Locomotion (running, 32 resets)	0 resets, 139 mm MPJPE	PyRoki: 32 resets, 183 mm MPJPE
ESDF (Redwood, 10mm, full workspace)	1.69 ms, 1.63 GB, 92.5% recall	nvblox: 12.68 ms, 11.87 GB, 92%

Performance advantages stem from full-coverage ESDF, scalable inverse dynamics (up to 14–18× speedup), topology-aware computation, and whole-body collision avoidance enabling policies with 21% lower tracking error and 12× lower cross-seed variance than prior frameworks.

6. Limitations and Future Directions

Current cuRoboV2 limitations include approximate camera segmentation (potentially addressable via learned RGB masks), single-camera workspace coverage (multi-view fusion is pending), and lack of end-to-end B-spline-based reactive planning via global-to-local MPC warm-starting. LLM-assisted code generation remains contingent on human expert interpretation of hardware feedback (such as register pressure), highlighting a need for improved automated understanding of CUDA performance diagnostics. Extending RNEA for differentiable contact and friction effects (enabling contact-implicit planning) is a proposed pathway for future extension. These limitations circumscribe further unification of perception, planning, and policy stacks for high-DoF systems (Sundaralingam et al., 5 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

cuRoboV2: Dynamics-Aware Motion Generation with Depth-Fused Distance Fields for High-DoF Robots (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to cuRoboV2.