- The paper introduces a novel framework that uses GPU-accelerated simulation and digital twins to generate high-fidelity demonstration data for robotic surgical tasks.
- The framework integrates VR teleoperation with the da Vinci model to collect expert demonstrations across a range of surgical subtasks, including tissue retraction and needle handovers.
- The experiments reveal performance trade-offs between RGB and point cloud modalities, underscoring challenges in achieving precise, contact-rich interactions in surgical settings.
This paper introduces SuFIA-BC (Surgical First Interactive Autonomy Assistants - Behavior Cloning), a framework designed to address the challenges of learning complex manipulation skills for robotic surgical assistants (RSAs). The core problem highlighted is the difficulty and expense of obtaining high-quality demonstration data for training visuomotor policies in surgical settings, compounded by the complexity of tasks involving delicate tissues, bimanual coordination, and precise contact-rich interactions.
To tackle this, the authors enhance the Orbit simulator with a surgical digital twin featuring photorealistic human anatomical organs and textures. This digital twin environment allows for GPU-accelerated physics simulation and high-fidelity ray-traced rendering, facilitating the generation of large-scale synthetic demonstration data. The process for creating these realistic assets involves:
- Segmenting anatomical structures from real CT scans using tools like VISTA3D/Auto3DSeg or generating synthetic CTs and segmentations with NVIDIA MAISI.
- Converting segmentations into 3D mesh models using the marching cubes algorithm.
- Refining these meshes through techniques like remeshing, topology optimization, and UV map generation.
- Applying physically based rendering materials, subsurface scattering, and custom shaders in Nvidia Omniverse for photorealism.
- Assembling the textured models into a unified OpenUSD file for use in the simulator.
Using this enhanced simulator and the da Vinci Research Kit (dVRK) model, the authors collect expert demonstrations via virtual reality (VR) teleoperation for five distinct surgical tasks: Tissue Retraction, Needle Lift, Needle Handover (bimanual), Suture Pad threading, and Block Transfer. These tasks represent a mix of fundamental surgical maneuvers and training exercises.
The paper then systematically evaluates state-of-the-art behavior cloning (BC) methods on these tasks using the collected demonstrations (50 per task). The primary BC approaches investigated are:
- Action Chunking with Transformers (ACT): Predicts sequences (chunks) of actions using a transformer architecture, leveraging temporal aggregation for smoother execution.
- Diffusion Policy (DP3): Uses a denoising diffusion process to iteratively refine action sequences, capable of modeling multi-modal action distributions.
Different visual observation spaces are compared:
- Single Camera: Uses RGB-D images from a single primary view (endoscopic or side view depending on the task).
- Multi Camera: Adds wrist-mounted cameras to the primary view (RGB-D).
- Point Cloud: Uses sparse, segmented point clouds derived from the primary camera's depth data (excluding color).
Key findings from the experiments include:
- Current SOTA BC methods struggle with the contact-rich, high-precision surgical tasks introduced, regardless of the perception or control architecture used.
- Simpler tasks like Tissue Retraction and Needle Lift see high success rates, especially when relying more on proprioceptive data. RGB models performed slightly better in Tissue Retraction due to visual cues (red marker).
- For complex tasks requiring precise grasping (Needle Handover, Suture Pad, Block Transfer), performance varies. Point cloud policies often outperform RGB in tasks where spatial relationships are critical and captured well by depth (e.g., Needle Handover), but struggle when color/texture is needed for disambiguation (e.g., Block Transfer).
- Sample efficiency analysis shows failure modes evolving from incomplete execution (10-20 demos) to imprecise grasps (30-40 demos) and persistent grasp instability even with 50 demos, highlighting the need for more adaptive policies.
- Instance generalization tests (using unseen needle types) reveal that RGB-based models (especially multi-camera) generalize better than point cloud models, which tend to overfit to the training object's specific geometry derived from depth.
- Viewpoint robustness tests show point cloud models are significantly more resilient to both small and large camera viewpoint changes compared to RGB models. Multi-camera RGB offers better robustness than single-camera RGB but is still outperformed by point clouds.
The authors conclude that their framework provides a valuable platform for evaluating BC methods in surgery. Their results underscore the limitations of current approaches on complex surgical tasks and highlight the trade-offs between different perception modalities (RGB providing semantic richness, point clouds offering geometric precision and viewpoint robustness). They emphasize the need for customized perception pipelines, control architectures, and larger synthetic datasets specifically tailored for the unique demands of surgical robotics.