cuRoboV2: GPU-Native Motion Framework
- cuRoboV2 is a fully GPU-native motion-generation framework that unifies global planning, reactive control, and whole-body retargeting for high-degree-of-freedom robots.
- It leverages B-spline trajectory optimization and a block-sparse TSDF-to-dense ESDF perception pipeline, achieving significant speedup and memory efficiency improvements over traditional methods.
- The framework integrates scalable GPU-native kinematics, dynamics, and self-collision modules, enabling real-time trajectory synthesis and precise robotic manipulation.
cuRoboV2 is a fully GPU-native motion-generation framework designed for safe, feasible, and reactive trajectory synthesis for high-degree-of-freedom (DoF) robots. It unifies global planning, reactive control, and whole-body retargeting, overcoming fragmentations in traditional pipelines where fast planners, reactive controllers, and high-DoF solvers are decoupled or insufficiently scalable. cuRoboV2 integrates B-spline trajectory optimization, a block-sparse TSDF-to-dense ESDF perception pipeline, and scalable, topology-aware GPU-native kinematics, dynamics, and self-collision modules into a single, high-performance stack, providing significant advances in memory efficiency, computation speed, and practical deployment across manipulation arms and full humanoids (Sundaralingam et al., 5 Mar 2026).
1. B-spline Trajectory Optimization
cuRoboV2 represents each joint trajectory by a uniform cubic B-spline parameterized over control points . The robot state vector at each time is given by
$\Theta_t = [\theta_t, \dot\theta_t, \ddot\theta_t, \dddot\theta_t]^T = T \cdot P(\alpha) \cdot C \cdot U_k$
where , , and are fixed matrices derived from time discretization and spline theory, and are four adjacent control points spanning the knot interval. This construction ensures -continuous splines, requiring only weak regularization on and .
The optimization cost at each time is:
subject to B-spline interpolation, joint and torque bounds, collision constraints via ESDF queries, goal accuracy, and early-stop criteria. Gradients are computed using the chain rule, exploiting that each influences only four knot spans, allowing GPU threads to accumulate in parallel and fuse computation and reduction within warps for memory efficiency.
The optimization proceeds in iterations: parallel time sampling of the B-spline, forward kinematics and inverse dynamics evaluation (RNEA) for link poses, torques, and Jacobians, parallel cost stream evaluation (scene collision, self-collision, bounds, energy), backpropagation, and parameter update via L-BFGS (or LM in IK). Implicit handling of trajectory boundaries ensures proper velocity and acceleration at trajectory endpoints.
2. GPU-Native TSDF → Dense ESDF Perception
The perception pipeline partitions the workspace into -voxel blocks managed in a CAS-protected hash pool, allocating only blocks within a truncation band (e.g., 2 cm) around observed surfaces. Each voxel encodes two float16 channels: (depth_sum, depth_weight) for fusion and a geometry SDF for analytic primitives. Voxel-project integration parses depth images: per-pixel threads compute signed distances for affected blocks and update averages without atomics. Frustum-aware temporal decay manages block lifetimes, maintaining active memory under 400 MB for 100k blocks.
To enable whole-workspace signed distance queries, cuRoboV2 synthesizes dense ESDF from the sparse TSDF via an on-demand, three-stage Parallel Banding Algorithm Plus (PBA+):
- Gather seeding: 3D grid cells are stenciled to locate "site" cells near TSDF zero-crossings.
- Distance propagation: Three sweeps (Z, Y, X axes) perform flood propagation and exact Euclidean transform via Maurer’s parabola intersection, requiring only five kernel launches and work.
- Sign recovery: Unsigned distances are signed via analysis of geometry and TSDF/ESDF consistency.
This approach yields up to 10× speedup and 8× lower memory consumption over nvblox, with 99% collision recall at millimeter-scale ESDF resolution.
3. Scalable GPU-Native Whole-Body Computation
cuRoboV2 deploys topology-aware kinematic and dynamic computation, efficiently scaling to robots with tens of DoF:
- Forward Kinematics: Adaptive dispatch runs efficient fused kernels for small robots (≤100 spheres) and decomposes computation on larger robots, separating transform calculation and sphere/Jacobian evaluation across many threads.
- Gradient Backpropagation: Warps assign threads to links with nonzero gradients, leveraging a topology cache for efficient ancestor lookup and warp synchronization (without global atomics) for joint gradient calculation.
- Inverse Kinematics (IK): Sparse Jacobian construction exploits mechanical coupling and sub-tree pruning using coarse "affects" masks and link chains, supporting scalability in LM-based optimization.
The framework's differentiable RNEA is a CUDA-Python kernel operating at runtime on arbitrary robot models (including mimic joints and runtime payload changes), with spatial transforms and inertia stored in compact factored forms. The forward (base-to-tips) and backward (tips-to-base) passes compute velocities, accelerations, and joint torques, supporting vector-Jacobian product backward differentiation in work. For self-collision, a two-stage map-reduce evaluates sphere-sphere potential penetrations (up to 162k on 48-DoF), partitioned by GPU block: local maxima are computed and reduced to a global maximum, converting a memory-bound routine to shared-memory compute-boundedness.
4. Implementation Patterns and Codebase Design
The codebase is structured for discoverability and productivity, enabling coding assistants to generate novel CUDA kernels and APIs. Key features include:
- Strong typing and self-documenting dataclasses, eschewing hidden YAML configs.
- Single-responsibility files (<300 lines), consistent use of
__all__, comprehensive testing (>4k documented tests). - Installation simplicity via
pip install cuRoboV2, with CUDA kernels compiled at first use using CUDA Python, decoupling installation from CUDA/PyTorch versions. - Category-first filenames and CamelCase classes, with documented default constructors.
- As a result, up to 73% of new modules (including hand-optimized CUDA kernels) were generated by an LLM assistant, illustrating the codebase's compatibility with assisted software creation.
Critical performance kernels include interleaved-warp B-spline gradient computation, atomics-free voxel TSDF integration, separable PBA+ ESDF propagation, shared-memory RNEA execution, and shared-memory map-reduce for self-collision.
5. Empirical Performance and Benchmarks
cuRoboV2 demonstrates both superior accuracy and scalability:
| Task | cuRoboV2 Result | Baselines |
|---|---|---|
| Payload-aware planning (3 kg, Franka) | 99.7% success (2,600 problems) | cuRobo v1: 77.1%, samplers: 72–75% |
| IK on 48-DoF humanoid (collision-free) | 99.6% success | cuRobo v1/PyRoki: 0% |
| IK speed (48-DoF, standard) | 100% in 34 ms | Newton: 98.4% in 317 ms, PyRoki: 49.8% in 1123 ms |
| Self-collision (48-DoF, speedup) | 61× vs cuRobo v1 | |
| Humanoid retargeting constraint satsfctn | IK: 89.5%, MPC: 96.6% | PyRoki: 61.2%, mink: 40.6% |
| Locomotion (running, 32 resets) | 0 resets, 139 mm MPJPE | PyRoki: 32 resets, 183 mm MPJPE |
| ESDF (Redwood, 10mm, full workspace) | 1.69 ms, 1.63 GB, 92.5% recall | nvblox: 12.68 ms, 11.87 GB, 92% |
Performance advantages stem from full-coverage ESDF, scalable inverse dynamics (up to 14–18× speedup), topology-aware computation, and whole-body collision avoidance enabling policies with 21% lower tracking error and 12× lower cross-seed variance than prior frameworks.
6. Limitations and Future Directions
Current cuRoboV2 limitations include approximate camera segmentation (potentially addressable via learned RGB masks), single-camera workspace coverage (multi-view fusion is pending), and lack of end-to-end B-spline-based reactive planning via global-to-local MPC warm-starting. LLM-assisted code generation remains contingent on human expert interpretation of hardware feedback (such as register pressure), highlighting a need for improved automated understanding of CUDA performance diagnostics. Extending RNEA for differentiable contact and friction effects (enabling contact-implicit planning) is a proposed pathway for future extension. These limitations circumscribe further unification of perception, planning, and policy stacks for high-DoF systems (Sundaralingam et al., 5 Mar 2026).