Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection (1603.02199v4)

Published 7 Mar 2016 in cs.LG, cs.AI, cs.CV, and cs.RO

Abstract: We describe a learning-based approach to hand-eye coordination for robotic grasping from monocular images. To learn hand-eye coordination for grasping, we trained a large convolutional neural network to predict the probability that task-space motion of the gripper will result in successful grasps, using only monocular camera images and independently of camera calibration or the current robot pose. This requires the network to observe the spatial relationship between the gripper and objects in the scene, thus learning hand-eye coordination. We then use this network to servo the gripper in real time to achieve successful grasps. To train our network, we collected over 800,000 grasp attempts over the course of two months, using between 6 and 14 robotic manipulators at any given time, with differences in camera placement and hardware. Our experimental evaluation demonstrates that our method achieves effective real-time control, can successfully grasp novel objects, and corrects mistakes by continuous servoing.

Authors (4)

Sergey Levine (531 papers)
Peter Pastor (13 papers)
Alex Krizhevsky (4 papers)
Deirdre Quillen (5 papers)

Citations (1,986)

View on Semantic Scholar

Summary

This paper (Levine et al., 2016 ) presents a practical learning-based system for robotic grasping that achieves hand-eye coordination directly from monocular images. A key contribution is that the method does not require precise camera-to-robot calibration, relying instead on learned visual feedback to guide the gripper. The approach consists of two main components: a deep convolutional neural network (CNN) trained to predict the probability of grasp success, and a continuous servoing mechanism that uses this network in real time to control the robot's movements.

The core idea is to train a model $g(I_t, v_t)$ that takes the current camera image $I_t$ and a proposed task-space motion vector $v_t$ (relative to the gripper's current position) as input and outputs the predicted probability that executing this motion and then closing the gripper will result in a successful grasp. Crucially, the motion vector $v_t$ is defined in the robot's frame, not the camera frame. This forces the network to learn the spatial relationship between the gripper, objects, and the robot's motion capability relative to the camera's view, effectively learning hand-eye coordination.

To train this CNN, the researchers collected a massive dataset of over 800,000 grasp attempts using a cluster of 6 to 14 robotic manipulators over two months. Each robot consisted of a 7 DoF arm, a 2-finger gripper, and a monocular camera mounted over the shoulder. Data collection was largely autonomous, with human intervention only for swapping objects. The diversity in the setup—slight variations in camera poses, lighting conditions, and even wear and tear on the grippers across different robots—provided a rich dataset that encouraged the network to be invariant to these factors, eliminating the need for individual calibration.

For each grasp attempt, a sequence of images $I_t^i$ and robot poses $p_t^i$ were recorded. The grasp outcome (success or failure $_i$ ) was determined automatically after the final step $T$ . Training samples were generated as $(I_t^i, p_T^i - p_t^i, _i)$ , teaching the network to predict success based on the current image and the vector from the current pose to the final grasp pose. This is a self-supervised approach; the training labels are generated automatically by the system's own actions and outcome evaluations.

The CNN architecture (illustrated in Figure 3 of the paper) processes concatenated current and pre-grasp images through several convolutional and pooling layers. The task-space motor command vector $v_t$ is processed separately through a fully connected layer and then point-wise added (broadcast) to the feature maps before further convolutional layers. This allows the network to reason about the outcome of a specific motion proposal within the context of the visual scene.

The second component is the continuous servoing mechanism $f(I_t)$ , which uses the trained prediction network $g$ to control the robot in real time. At each time step, given the current image $I_t$ , the system searches for the task-space motion vector $v_t^\star$ that maximizes the predicted grasp success probability $g(I_t, v_t)$ . This optimization is performed efficiently using the Cross-Entropy Method (CEM), a derivative-free optimization technique. In their implementation, they sample 64 candidate motions, select the best 6 based on network prediction, fit a Gaussian to these samples, and repeat for a few iterations (e.g., 3). This allows the robot to continuously re-evaluate and adjust its trajectory based on the latest visual feedback.

The servoing mechanism incorporates heuristics based on the predicted success probability of staying put ( $g(I_t, \emptyset)$ ) relative to the best predicted motion ( $g(I_t, v_t^\star)$ ). If the probability of immediate success is high relative to moving ( $g(I_t, \emptyset) / g(I_t, v_t^\star) > 0.9$ ), the robot stops and attempts the grasp. If it's low ( $g(I_t, \emptyset) / g(I_t, v_t^\star) \leq 0.5$ ), the gripper is raised during the motion to avoid collisions while repositioning. Otherwise, the robot executes the best predicted motion $v_t^\star$ towards the table surface. Pseudocode for this logic is provided in Algorithm 1.

def servoing_mechanism(I_t, g):
    # 1. Infer best motor command v_t* using g and CEM
    # CEM:
    # Initialize distribution (e.g., Gaussian centered at current pose)
    # For i in range(num_cem_iterations):
    #   Sample N candidate vectors v_i from current distribution
    #   Evaluate predicted success g(I_t, v_i) for each candidate
    #   Select M best candidates (e.g., top 10%)
    #   Fit new distribution (mean and covariance) to selected candidates
    # v_t_star = mean of final distribution or best candidate from final samples

    v_t_star = find_best_motion_cem(I_t, g) # Placeholder for CEM optimization

    # 2. Evaluate predicted success probabilities
    prob_no_motion = g.predict(I_t, zero_motion_vector) # Predict success if stop now
    prob_best_motion = g.predict(I_t, v_t_star) # Predict success for best motion

    # 3. Apply heuristics based on relative probabilities
    if prob_best_motion > epsilon: # Avoid division by zero or near-zero
        p = prob_no_motion / prob_best_motion
    else:
        p = 0 # Treat as low confidence in any motion

    if p > 0.9:
        # Probability of success without moving is close to maximum predicted
        output_command = zero_motion_vector # Stop movement
        trigger_grasp = True
    elif p <= 0.5:
        # Current position is likely bad, large move needed, raise gripper
        modified_v_t_star = modify_motion_to_raise_gripper(v_t_star)
        output_command = modified_v_t_star # Execute raised motion
        trigger_grasp = False
    else:
        # Current position is okay, move towards best predicted location
        output_command = v_t_star # Execute best motion towards table height
        trigger_grasp = False

    return output_command, trigger_grasp

def find_best_motion_cem(I_t, g):
    # Implementation of CEM to find v_t that maximizes g(I_t, v_t)
    # ... (details omitted for brevity, involves sampling, evaluation, refitting Gaussian)
    # Ensure sampled motions stay within workspace and obey constraints
    pass

def modify_motion_to_raise_gripper(v_t):
    # Modify the motion vector to include lifting the gripper
    # e.g., sample a height offset between 4-10 cm above the table
    pass

The paper interprets this method as a form of reinforcement learning, specifically related to fitted Q-iteration. By training the network to predict the outcome of moving from $p_t$ to $p_T$ ( $p_T - p_t$ ), it approximates learning a Q-function $Q(s_t, a_t) = \mathbb{E}[R | s_t, a_t]$ where $s_t$ is the image $I_t$ , and the action $a_t$ is represented by the final pose difference $p_T - p_t$ . This assumes a transitive property of actions, which simplifies training compared to predicting based on immediate next-step actions $p_{t+1}-p_t$ , though it doesn't capture the potential effects of intermediate movements on the environment.

Experimental results on novel objects not seen during training demonstrate the effectiveness of the approach. Comparing against random, a hand-engineered baseline using depth and calibration, and an open-loop method using the same CNN/data but requiring calibration, the continuous servoing method achieved significantly lower failure rates, particularly in challenging "without replacement" scenarios where objects are removed from the bin. This highlights the importance of both the data scale and the continuous visual feedback for robustness and adaptation. Results also show that performance improves with increasing dataset size, suggesting that even larger datasets could yield further improvements.

From a practical implementation perspective, deploying this system involves:

Hardware Setup: A robotic arm with enough degrees of freedom (7 DoF in the paper) and a suitable gripper. A monocular camera needs to be mounted to provide an egocentric view that includes the gripper and the workspace. Multiple robots are highly beneficial for scaling data collection, but a single robot could be used for deployment after training.
Data Collection System: Building the infrastructure to autonomously execute grasp attempts, record images and poses, and automatically label success/failure. This requires integrating robot control, vision capture, pose tracking, and outcome evaluation logic. The success evaluation method (gripper state check + image subtraction drop test) is a practical detail.
Training Pipeline: Implementing the CNN architecture (Figure 3) using deep learning frameworks (e.g., TensorFlow, PyTorch). The input processing involves concatenating images and broadcasting the motion vector. Training uses a standard binary cross-entropy loss on the collected dataset. Batch normalization is used to stabilize training. The dataset size (over 800,000 images/samples per trajectory step) indicates substantial storage and processing requirements for training.
Real-time Inference and Control: Implementing the continuous servoing loop. This involves capturing the current image, running the CEM optimization to find the best motion $v_t^\star$ by querying the trained CNN multiple times per control cycle, applying the heuristics (Algorithm 1), and sending the resulting task-space command to the robot's low-level controller. Achieving the reported control frequency (2-5 Hz) requires efficient CNN inference and CEM computation.

Potential limitations include the assumption that test-time scenarios resemble training scenarios. Generalization to drastically different robot platforms, gripper types (beyond pinch grasps), or environments (e.g., grasping from shelves) may require collecting more diverse data covering these variations. While robust to variations in camera calibration and gripper wear, it might not generalize to entirely new hardware or significant structural changes without retraining on relevant data.

Overall, this work demonstrates a powerful data-driven approach that leverages large-scale self-supervised data and continuous visual servoing to achieve robust, calibration-free robotic grasping. The practical setup and results provide a strong case for the viability of learning complex manipulation skills directly from raw sensor data and interaction experience.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos