Papers
Topics
Authors
Recent
2000 character limit reached

DSGD: Densely Supervised Grasp Detector

Updated 6 January 2026
  • The paper demonstrates a novel hierarchical architecture combining global, region, and pixel-level supervision, achieving up to 97.5% detection accuracy.
  • The framework employs DenseNet-style layer-wise feature fusion across seven dense blocks to effectively merge low- and high-level features for robust grasp prediction.
  • The real-time inference with a confidence-based branch selection strategy yields over 90% robotic grasping success while processing at 9–12 fps.

The Densely Supervised Grasp Detector (DSGD) is a deep learning framework for robotic grasp detection that incorporates a hierarchical architecture with parallel global-, region-, and pixel-level supervision. The method leverages a DenseNet-style convolutional neural network backbone to enable layer-wise feature fusion, supporting precise and robust grasp detection across varying granularity within RGB-D images. DSGD generates and evaluates grasp candidates at multiple image scales and uses a confidence-based selection mechanism during inference, leading to state-of-the-art grasp detection accuracy and high robotic success rates in real time (Asif et al., 2018).

1. Architectural Overview

DSGD is structured around a shared DenseNet-style backbone that extracts multi-scale features from the input RGB-D image. Notably, the input is represented as a 3-channel image with depth replacing the Blue channel and normalized to [0,255]. The architecture is organized into three parallel branches:

  • Global Grasp Network (GGN): Predicts a single five-parameter grasp for the entire image along with an associated confidence score.
  • Region Grasp Network (RGN): Integrates a Salient Region Network (SRN) which proposes axis-aligned candidate rectangles—salient parts—each processed by a Region Grasp Prediction Network (RGPN). The RGPN outputs a class-specific grasp, segmentation mask, and confidence for each region.
  • Pixel Grasp Network (PGN): Implements a fully convolutional branch that, after upsampling, outputs per-pixel heatmaps corresponding to the five grasp parameters.

During inference, each branch independently produces grasp candidates and corresponding confidence scores. The selection strategy prioritizes the highest-confidence grasp, first from the RGN (if above threshold δ_rgn), then PGN (if above δ_pgn), and finally GGN as a fallback. This ensembling process overcomes weaknesses intrinsic to each individual branch.

2. Layer-wise Feature Fusion via DenseNet

The DSGD backbone employs seven dense blocks, each implementing dense connectivity such that every block receives input from all previous blocks via channel-wise concatenation:

Xl=Conv3×3(ReLU(Conv1×1([X0,...,Xl1])))X_l = \text{Conv}_{3 \times 3}(\text{ReLU}(\text{Conv}_{1 \times 1}([X_0, ..., X_{l-1}])))

This facilitates direct propagation of both low-level and high-level features across the architecture. Feature taps for the three branches occur at discrete depths: GGN utilizes the pooled output from Dense Block 5, RGN employs ROI-pooled features from Block 4 (propagated through Block 5), and PGN accesses Block 7 output followed by dedicated upsampling and convolution for spatial localization. The layer-wise fusion provided by dense connectivity offers implicit skip connections and multi-scale information flow without explicit skip links, supporting accurate and robust grasp representation learning.

3. Grasp Parameterization, Confidence, and Loss Functions

In DSGD, every grasp is modeled as an oriented rectangle parameterized by:

g=[x,y,w,h,θ]Tg = [x, y, w, h, \theta]^T

where (x, y) denotes center pixel coordinates, w and h are width and height, and θ is rotation within [π/2,π/2)[–\pi/2,\,\pi/2). Each branch also predicts a scalar confidence ρ[0,1]\rho \in [0, 1] per grasp. In the RGN, grasps are class-specific (graspable vs.\ non-graspable):

Gr=[xrn,yrn,wrn,hrn,θrn,ρrn],n{0,1}G_r = [x_r^n, y_r^n, w_r^n, h_r^n, \theta_r^n, \rho_r^n],\quad n \in \{0, 1\}

For the PGN, each parameter is encoded as a heatmap:

Gp={Mxy,Mw,Mh,Mθ}Rs×W×HG_p = \{M_{xy}, M_w, M_h, M_\theta\} \in \mathbb{R}^{s \times W \times H}

with s=1s = 1 for position and size, s=Nθ=50s = N_\theta = 50 for angles. Supervision is realized through regression and classification losses:

  • Regression: Lreg(R,R)=RR2/R2L_\text{reg}(R, R^*) = \|R - R^*\|_2 / \|R^*\|_2
  • Classification: Lcls(p,p)=cpclogpcL_\text{cls}(p, p^*) = -\sum_c p^*_c \log p_c (cross-entropy)
  • Hyperparameters λ1,λ2,λ3=0.4\lambda_1, \lambda_2, \lambda_3 = 0.4 balance regression and classification terms in all experiments.

Detailed loss function compositions are specified per-branch, incorporating grasp location/size regression, angle and confidence classification, and, for RGN, a segmentation loss:

Lseg=1SijSi[yjlogy^j+(1yj)log(1y^j)]L_\text{seg} = -\frac{1}{|S_i|} \sum_{j \in S_i} [y_j \log \hat{y}_j + (1 - y_j) \log (1 - \hat{y}_j)]

4. Hierarchical Branch Design

4.1 Global Grasp Network (GGN)

GGN combines a Global Grasp Prediction Network (GGPN), which regresses the five grasp parameters, and a Grasp Evaluation Network (GEN), which assesses confidence by reinjecting the rectangle prediction into the input and forwarding through a small FCN.

4.2 Region Grasp Network (RGN)

RGN is composed of SRN for salient part proposal (128 axis-aligned rectangles) and RGPN, which processes each proposal via an ROI-pooled crop and predicts local grasp parameters, angle, confidence, and a segmentation mask at 14×1414 \times 14 resolution.

4.3 Pixel Grasp Network (PGN)

PGN operates fully convolutionally, decoding the final dense block feature map through upsampling and convolution to yield four heatmaps representing (x, y), w, h, and θ (with angular classification over 50 bins), providing dense supervisory signal for spatially precise local grasping.

5. Inference, Confidence-based Selection, and Robustness

Inference executes all three branches in parallel, each outputting candidate graps and confidence:

  1. Select RGN's highest-confidence grasp if ρrgnδrgn=0.95\rho_\text{rgn} \geq \delta_\text{rgn} = 0.95.
  2. Else, select PGN's highest-confidence grasp if ρpgnδpgn=0.90\rho_\text{pgn} \geq \delta_\text{pgn} = 0.90.
  3. Otherwise, revert to the GGN grasp.

This hierarchical selection ensures prioritization of precise part-based or pixelwise solutions when reliable, but enables fallback to the global hypothesis to avoid system failure. The design addresses limitations such as global averaging errors (for symmetric objects), pixel noise in local predictions, and false salient region proposal outliers.

6. Experimental Protocol and Empirical Results

DSGD was validated across several benchmarks:

  • Cornell Grasp Dataset (885 RGB-D images, 240 isolated objects): 97.5% accuracy (object-wise split), surpassing prior best of ~89%.
  • Multi-object dataset (6,896 RGB-D scenes, 2–8 objects/scene): 97.2% detection accuracy (DenseNet) and 95.8% (ResNet), outperforming standalone PGN (87.4%) and RGN (94.7%).
  • Real-robot evaluation: Deployed on a 7-DOF arm with parallel gripper and Kinect sensor, 90% grasping success rate over 200 trials.

Evaluation employed the rectangle metric: a grasp prediction is correct iff θpredθgt<30|\theta_\text{pred} - \theta_\text{gt}| < 30^\circ and Jaccard index J(Rpred,Rgt)>25%J(\mathcal{R}_\text{pred}, \mathcal{R}_\text{gt}) > 25\%, with J(R,R)=RR/RRJ(\mathcal{R}^*, \mathcal{R}) = |\mathcal{R}^* \cap \mathcal{R}|/|\mathcal{R}^* \cup \mathcal{R}|. Robotic success required the robot to lift and hold an object for 3 seconds, confirmed by a force sensor.

Inference runs in real time (9–12 fps, "DSGD-lite" variant at 12 fps with 97.1% Cornell accuracy). Ablation studies demonstrate the incremental value of each branch:

Branch Accuracy (Cornell, object-wise, DenseNet)
GGN alone 88.9%
PGN alone 95.4%
RGN alone 96.8%
Full DSGD 97.5%

Altering the Dense block growth rate WW and layer count NlN_l trades accuracy for speed.

7. Significance, Limitations, and Practical Implications

DSGD's combination of dense feature fusion and triply-supervised architecture achieves state-of-the-art grasp detection accuracy (>97%) and high robotic grasping success (>90%) on both previously seen and unseen objects, including in cluttered environments. The multi-level supervision and hierarchical ensemble enable robustness to diverse failure modes: global contextual errors, pixel noise, and local proposal inaccuracies. The system's real-time inference and hierarchical decision strategy offer practical benefits for deployment in robotic systems where efficiency, accuracy, and failure tolerance are essential (Asif et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Densely Supervised Grasp Detector (DSGD).