Robotic Grasp Detection using Deep Convolutional Neural Networks (1611.08036v4)

Published 24 Nov 2016 in cs.RO and cs.CV

Abstract: Deep learning has significantly advanced computer vision and natural language processing. While there have been some successes in robotics using deep learning, it has not been widely adopted. In this paper, we present a novel robotic grasp detection system that predicts the best grasping pose of a parallel-plate robotic gripper for novel objects using the RGB-D image of the scene. The proposed model uses a deep convolutional neural network to extract features from the scene and then uses a shallow convolutional neural network to predict the grasp configuration for the object of interest. Our multi-modal model achieved an accuracy of 89.21% on the standard Cornell Grasp Dataset and runs at real-time speeds. This redefines the state-of-the-art for robotic grasp detection.

Authors (2)

Sulabh Kumra (6 papers)
Christopher Kanan (72 papers)

Citations (402)

View on Semantic Scholar

Summary

The paper introduces a novel deep convolutional neural network framework that integrates uni-modal and multi-modal grasp predictors for enhanced detection efficiency.
It replaces traditional sliding-window techniques with a single-pass global prediction, achieving accuracies of 88.84% with RGB and 88.96% with RGB-D data on the Cornell Grasping Dataset.
The approach operates at 9.71 frames per second, enabling real-time performance and advancing robotic manipulation in unstructured environments.

Robotic Grasp Detection using Deep Convolutional Neural Networks

The research paper "Robotic Grasp Detection using Deep Convolutional Neural Networks" presents a significant advancement in the field of robotic grasp detection through the application of deep learning methodologies. The authors introduce a novel approach that leverages deep convolutional neural networks (DCNNs), specifically utilizing a ResNet-50 architecture, to predict effective grasp poses for robotic grippers based on RGB-D imagery. This research is situated within the context of improving robotic interaction with unstructured environments, particularly addressing the challenge of grasp detection, which is a critical component of robotic manipulation.

Methodological Approach

The research outlines two distinct approaches for grasp detection: a uni-modal grasp predictor using RGB data and a multi-modal grasp predictor using RGB-D data. The uni-modal grasp predictor is built upon a ResNet-50 framework, enabling the classification of grasp configurations using RGB input. This model replaces the conventional sliding window detection technique with a more efficient single-pass global grasp prediction, which significantly enhances processing speed.

The multi-modal grasp predictor extends this model by incorporating both RGB and depth data, utilizing two parallel, pre-trained ResNet-50 models to extract features simultaneously from these modalities. The features derived from these networks are integrated and further processed by a shallow convolutional neural network to predict the graspability of objects. This architecture is particularly noteworthy, as it enhances the capacity of the system to discern object features that are only evident from a depth perspective, thereby improving grasp prediction accuracy.

Experimental Evaluation

The models were evaluated using the Cornell Grasping Dataset, a benchmark dataset in the field, under two splitting scenarios: image-wise and object-wise. This evaluation strategy ensures that the models are tested not only on variations of previously seen objects but also on entirely novel objects – a robust test of generalization.

In quantitative terms, the uni-modal grasp predictor achieved an accuracy of 88.84% with RGB images, and the multi-modal approach attained a groundbreaking accuracy of 88.96% using RGB-D data. These results surpass previous state-of-the-art methods, which reported significantly lower accuracies and were limited by slower processing speeds. The proposed multi-modal model operates at 9.71 frames per second, indicating its capability to function in real-time scenarios.

Implications and Future Work

The implications of this research are substantial for the development of autonomous systems capable of interacting with their environments with human-like dexterity. By addressing both accuracy and real-time processing requirements, this work bridges a critical gap between theoretical grasp detection frameworks and practical robotic applications.

Future developments in this area could focus on the extension of these models to incorporate larger and more complex datasets, potentially enhancing their performance and robustness. Additional exploration into pre-training strategies for depth data and the integration of a four-channel input configuration could further augment the model's efficacy. Such advancements would significantly contribute to the generalizability and applicability of robotic manipulators in diverse and dynamic operational settings.

This research enhances the knowledge base of robotic grasping by demonstrating that deep networks, through careful architectural design and training on multi-modal data, can achieve high performance in object grasp detection tasks. These findings underscore the potential of deep learning in advancing robotic perception systems.