End-to-end Trainable Deep Neural Network for Robotic Grasp Detection and Semantic Segmentation from RGB (2107.05287v2)

Published 12 Jul 2021 in cs.CV and cs.RO

Abstract: In this work, we introduce a novel, end-to-end trainable CNN-based architecture to deliver high quality results for grasp detection suitable for a parallel-plate gripper, and semantic segmentation. Utilizing this, we propose a novel refinement module that takes advantage of previously calculated grasp detection and semantic segmentation and further increases grasp detection accuracy. Our proposed network delivers state-of-the-art accuracy on two popular grasp dataset, namely Cornell and Jacquard. As additional contribution, we provide a novel dataset extension for the OCID dataset, making it possible to evaluate grasp detection in highly challenging scenes. Using this dataset, we show that semantic segmentation can additionally be used to assign grasp candidates to object classes, which can be used to pick specific objects in the scene.

Citations (104)

View on Semantic Scholar

Summary

The paper proposes an end-to-end CNN architecture for robotic grasp detection and semantic segmentation from RGB images, using a shared ResNet-101/FPN backbone with task-specific branches.
A key feature is a grasp refinement module that incorporates semantic information to improve grasp orientation accuracy, demonstrated by achieving state-of-the-art results on Cornell, Jacquard, and OCID_grasp datasets.
The integrated approach allows for multitask learning, showing high performance in both grasp accuracy (e.g., 89.02% on OCID_grasp) and segmentation IoU (e.g., 94.05% on OCID_grasp), suggesting viability for real-time, multi-object robotic applications.

Analysis of an End-to-End CNN-Based Architecture for Robotic Grasp Detection and Semantic Segmentation

The paper by Ainetter and Fraundorfer presents an end-to-end trainable CNN architecture that integrates robotic grasp detection and semantic segmentation from RGB images. This work is of notable significance in the domain of robotic vision, addressing the intricate task of recognizing and manipulating objects in complex, cluttered environments with high accuracy, leveraging Convolutional Neural Networks (CNNs).

Technical Contributions and Methodology

The authors propose a novel architecture that encompasses two primary components: a shared backbone network based on ResNet-101 integrated with a Feature Pyramid Network (FPN), and task-specific branches for grasp detection and semantic segmentation. The grasp detection branch is inspired by the Faster R-CNN framework, which is modified to predict grasp candidates in addition to object proposals. The segmentation branch relies on a Mini-DeepLab module that generates pixel-wise semantic labels.

A distinct feature of the proposed model is a grasp refinement module that refines initial grasp predictions by incorporating semantic information. This module enhances the model's grasp orientation accuracy significantly. The ability to fuse segmentation and grasp predictions empowers the model to accurately identify grasp points specific to object classes, which can be crucial in tasks requiring object class-dependence, such as sorting or selective picking.

The architecture shows robust performance on various datasets like Cornell, Jacquard, and a modified version of the OCID dataset—dubbed OCID_grasp—affording it a comprehensive evaluation across multiple setting complexities.

Results and Implications

The model achieves state-of-the-art accuracy on prominent datasets with superior FPS rates, suggesting its viability for real-time applications. On the Cornell dataset, the grasp detection branch alone registers an accuracy of 98.2%. For the Jacquard dataset, introducing the segmentation-refinement collaboration marginally improves accuracy, highlighting the efficacy of the refinement strategy especially under strict grasp orientation requirements.

Furthermore, extending OCID with annotated grasping candidates, complements the architecture's scope by demonstrating high performance in multi-object scenarios. This capability is embodied by achieving a grasp accuracy of 89.02% and a segmentation IoU of 94.05% in the OCID_grasp dataset.

Discussion and Future Directions

By integrating grasp detection with semantic segmentation, the paper underscores the significant advantage of multitask learning in robotic grasping tasks. The design reflects a trend towards more unified architectures that accommodate multiple perceptual tasks within a single network framework, efficiently utilizing computational resources while improving cross-task synergies.

Future endeavors could explore enhancements using multi-modal inputs such as RGB-D data to capitalize on the additional spatial information depth maps provide, potentially elevating grasp detection precision. Moreover, extending this methodology to end-to-end real-world robotic applications, involving physical robot trials, would be a logical progression to ascertain its practical robustness and utility.

Overall, this work contributes valuably to advancing robotic perception capabilities, broadening the application of CNNs in complex robotic grasping tasks while fostering future inquiry into multitask neural network architectures.

Related Papers

YouTube

Show All Videos