Q-attention: Enabling Efficient Learning for Vision-based Robotic Manipulation

Published 31 May 2021 in cs.RO, cs.AI, cs.CV, and cs.LG | (2105.14829v2)

Abstract: Despite the success of reinforcement learning methods, they have yet to have their breakthrough moment when applied to a broad range of robotic manipulation tasks. This is partly due to the fact that reinforcement learning algorithms are notoriously difficult and time consuming to train, which is exacerbated when training from images rather than full-state inputs. As humans perform manipulation tasks, our eyes closely monitor every step of the process with our gaze focusing sequentially on the objects being manipulated. With this in mind, we present our Attention-driven Robotic Manipulation (ARM) algorithm, which is a general manipulation algorithm that can be applied to a range of sparse-rewarded tasks, given only a small number of demonstrations. ARM splits the complex task of manipulation into a 3 stage pipeline: (1) a Q-attention agent extracts relevant pixel locations from RGB and point cloud inputs, (2) a next-best pose agent that accepts crops from the Q-attention agent and outputs poses, and (3) a control agent that takes the goal pose and outputs joint actions. We show that current learning algorithms fail on a range of RLBench tasks, whilst ARM is successful.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (106)

View on Semantic Scholar

Summary

The paper introduces the ARM algorithm, integrating a novel Q-attention module to extract key pixel information from RGB and point cloud inputs.
It employs a three-stage process with a next-best pose agent and control agent, achieving superior performance on RLBench tasks.
Ablation studies show that demo augmentation and a confidence-aware critic enhance learning stability and robustness in sparse reward settings.

Q-attention: Enabling Efficient Learning for Vision-based Robotic Manipulation

The paper "Q-attention: Enabling Efficient Learning for Vision-based Robotic Manipulation" (2105.14829) introduces an Attention-driven Robotic Manipulation (ARM) algorithm for manipulation tasks with sparse rewards. This algorithm uses a novel Q-attention module to extract relevant pixel locations from RGB and point cloud inputs, a next-best pose agent, and a control agent to output joint actions. The approach achieves improved performance on a set of RLBench tasks compared to existing RL algorithms.

Methodological Overview

The ARM algorithm pipelines manipulation into three stages. First, the Q-attention module identifies relevant pixel locations from RGB images and point clouds. This module treats the image as an environment and pixel locations as actions, learning to focus on key areas. Second, a next-best pose agent receives crops from the Q-attention module and predicts 6D poses using a confidence-aware critic. Finally, a control agent uses these poses to output joint actions, controlling the robot's movements. The method incorporates demonstrations to improve initial exploration, using a keyframe discovery strategy to choose relevant frames and a demo augmentation method to increase the proportion of informative transitions in the replay buffer.

Algorithm 1 outlines the ARM procedure:

Algorithm 1: ARM
Input: Initial Q-attention network Q_\psi, twin-critic networks Q_{\theta_1}, Q_{\theta_2}, and actor network π_\phi
Initialize: Target networks Q_{\psi'} ← Q_\psi, Q_{\theta_1'} ← Q_{\theta_1}, Q_{\theta_2'} ← Q_{\theta_2}
Initialize: Replay buffer D with demos, apply keyframe discovery and demo augmentation
for each iteration do
    for each environment step do
        (b_t, p_t, z_t) ← o_t {Observation}
        (x_t, y_t) ← argmax2D_{a'} Q_\psi((b_t, p_t), a') {Q-attention: pixel coordinates}
        b_t', p_t' ← crop(b_t, p_t, (x_t, y_t)) {Crop RGB and point cloud}
        a_t ∼ π_φ(b_t', p_t', z_t) {Sample pose from policy}
        o_{t+1}, r ← env.step(a_t) {Execute action}
        D ← D ∪ {(o_t, a_t, r, o_{t+1}, (x_t, y_t))} {Store transition}
    end
    for each gradient step do
        ψ ← ψ - ∇_ψ J_Q(ψ) {Update Q-attention}
        θ_i ← θ_i - ∇_{θ_i} J_Q(θ_i) for i ∈ {1, 2} {Update critic}
        φ ← φ - ∇_φ J_π(φ) {Update policy}
        ψ' ← τψ + (1-τ)ψ' {Update Q-attention target}
        θ_i' ← τθ_i + (1-τ)θ_i' for i ∈ {1, 2} {Update critic target}
    end
end

Key Innovations

The Q-attention mechanism is a key contribution, representing an off-policy hard attention mechanism learned via Q-Learning. The confidence-aware Q function, which predicts pixel-wise Q values and confidence values, improves actor-critic stability. Additionally, the keyframe discovery and demo augmentation methods improve the utilization of demonstrations in RL.

Experimental Results

The ARM algorithm was evaluated on eight RLBench tasks, demonstrating its ability to solve challenging, sparsely-rewarded manipulation tasks. The algorithm outperforms baseline methods, including behavioral cloning, SAC+AE, DAC, SQIL, and DrQ. Ablation studies validate the importance of the Q-attention module, with the confidence-aware critic and demo augmentation contributing to overall stability and performance. The method demonstrates robustness to varying numbers of demonstrations and crop sizes.

Implementation Details

The Q-attention network implements a lightweight U-Net architecture. The next-best pose agent utilizes a modified version of SAC with a confidence-aware soft Q-function. Motion planning with the SBL planner within OMPL is used for the control agent. Keyframe discovery involves identifying states with changes in gripper state or near-zero velocities. Demo augmentation stores transitions from intermediate points to keyframe states, maximizing demonstration utility.

Implications and Future Directions

The ARM algorithm represents a step toward more efficient and generalizable robotic manipulation. The Q-attention mechanism and confidence-aware critic offer potential for broader application in RL. Future research could focus on extending the approach to dynamic environments, integrating multiple camera inputs, and improving sample efficiency for real-world training.

Markdown Report Issue