Learning to See Forces: Surgical Force Prediction with RGB-Point Cloud Temporal Convolutional Networks (1808.00057v1)

Published 31 Jul 2018 in cs.CV and cs.RO

Abstract: Robotic surgery has been proven to offer clear advantages during surgical procedures, however, one of the major limitations is obtaining haptic feedback. Since it is often challenging to devise a hardware solution with accurate force feedback, we propose the use of "visual cues" to infer forces from tissue deformation. Endoscopic video is a passive sensor that is freely available, in the sense that any minimally-invasive procedure already utilizes it. To this end, we employ deep learning to infer forces from video as an attractive low-cost and accurate alternative to typically complex and expensive hardware solutions. First, we demonstrate our approach in a phantom setting using the da Vinci Surgical System affixed with an OptoForce sensor. Second, we then validate our method on an ex vivo liver organ. Our method results in a mean absolute error of 0.814 N in the ex vivo study, suggesting that it may be a promising alternative to hardware based surgical force feedback in endoscopic procedures.

Citations (21)

View on Semantic Scholar

Summary

The paper introduces the RPC-TCN method, which fuses RGB and point cloud spatial features with temporal convolutional networks to accurately predict surgical forces from vision data alone.
Experiments show RPC-TCN achieves low mean absolute errors (0.604% in phantom, 0.427% ex vivo) and high correlation (0.99+) with ground truth forces, outperforming methods using single modalities or frames.
This research demonstrates that combining multi-modal vision data and temporal modeling enables highly accurate surgical force estimation, reducing the need for potentially complex physical force sensors.

The paper presents a technical paper on vision-based force prediction during robotic surgery by combining multi-modal inputs and temporal modeling. The proposed method, called RGB-Point Cloud Temporal Convolutional Network (RPC-TCN), integrates both 2D and 3D spatial features along with temporal convolutional features to infer the contact forces during surgical manipulation.

The approach is divided into two main modules:

Spatial Block:
- Uses an RGB image input of size $224 \times 224$ and a depth image input of size $151 \times 151$ . The depth image is transformed into a 3D point cloud by applying camera intrinsic parameters and normalization. Specifically, the conversion is given by
- $x_{pc} = (x_D - c_x)\frac{z_D}{f_x} - \bar{x}_D, \quad y_{pc} = (y_D - c_y)\frac{z_D}{f_y} - \bar{y}_D, \quad z_{pc} = z_D - \bar{z}_D,$
- where:
- $x_D, y_D$ : pixel indices of the depth image
- $z_D$ : depth value at the pixel
- $c_x, c_y$ : principal point coordinates
- $f_x, f_y$ : focal lengths
- $\bar{x}_D, \bar{y}_D, \bar{z}_D$ : mean values used for normalization.
- The method then leverages a pre-trained VGG16 network to extract a 4096-dimensional feature vector from the RGB image and a pre-trained PointNet to extract a 512-dimensional feature from the point cloud (after uniform downsampling from 22,801 points to 2048 points). The concatenation yields a 4608-dimensional feature vector that encodes complementary spatial information.
Temporal Block:
- The concatenated spatial features for each time step are aggregated over a sliding window (15 frames) to form a temporal feature tensor. The temporal block employs a Temporal Convolutional Network (TCN) that processes these sequential features through a hierarchy of convolutional layers. Each convolutional layer in the TCN applies filters $W^{(i)} \in \mathbb{R}^{d \times F_{l-1}}$ with bias $b \in \mathbb{R}^{F_l}$ , where the activations of the $l^\text{th}$ layer are computed by
- $E_{(l)} = f(W * E_{(l-1)} + b),$
- with $f(\cdot)$ denoting a non-linear activation function—ReLU was empirically found to outperform other choices—and $*$ representing the convolution operation. Batch normalization follows each convolutional layer to promote stable training.
- The last fully-connected layer employs a linear regression to map the final temporal features to the force prediction, expressed as
- $\hat{Y}_t = U E^{(L)} + c,$
- where $U$ is the filter for the last layer, $c$ is the bias, and $\hat{Y}_t$ is the predicted force at time $t$ .

Two experimental setups are reported: a phantom paper and an ex vivo liver paper. Both experiments use synchronized RGB, depth, and force data acquired at 30 fps, with the force measured by an OptoForce sensor (accuracy up to $\SI{12.5e-3}{\newton}$). In these studies, the model was trained on segmented data (80% for training, 5% for validation, and 15% for testing).

Key numerical results include:

Phantom Study:
- RPC-TCN achieved a mean absolute error (MAE) of 1.45 N corresponding to a percentage error of 0.604% when compared to the maximal force magnitude measured.
Ex vivo Liver Study:
- The MAE of RPC-TCN is 0.814 N, equivalent to a percentage error of 0.427%.

Additional comparisons were made against alternative methods:

A single-frame regression approach based solely on RGB features resulted in significantly higher errors (MAE of 7.06 N and 10.4 N in phantom and liver studies, respectively).
Temporal models using only RGB features (RGB-TCN) or only point-cloud data (Point Cloud-TCN) performed better than the single-frame method but were still outperformed by the fusion model (RPC-TCN).

The paper also details an analysis of error distribution across force bins. Notably, while the RGB-TCN method showed lower errors for smaller force magnitudes, it suffered from increased error variance at higher forces. In contrast, the point cloud modality provided a more uniform error distribution across different force levels. The fusion approach of RPC-TCN resulted in a consistent error trend across the force spectrum.

Additionally, the authors report high correlation coefficients (0.995 for the phantom and 0.996 for the ex vivo liver experiments) between the predicted and ground truth force values, indicating strong linear correspondence. Visualizations (e.g., correlation matrices and error vs. epoch trend plots) further support the stability and robustness of the proposed approach.

Limitations and Future Directions:

The training data was biased towards smaller forces, which affected large force prediction accuracy.
The model was trained and validated on a single phantom and ex vivo liver specimen, suggesting potential overfitting to these specific conditions.
Future work is proposed to include a more diverse range of tissues and the integration of monocular depth estimation techniques (as evidenced in previous studies) to enhance clinical utility.

Overall, the paper demonstrates that the fusion of 2D visual cues with 3D spatial information, in conjunction with temporal convolutional modeling, can significantly enhance the accuracy and robustness of surgical force estimation without the complexities associated with dedicated hardware-based force sensors.

PDF Markdown

Learning to See Forces: Surgical Force Prediction with RGB-Point Cloud Temporal Convolutional Networks (1808.00057v1)

Summary

Related Papers