HDCNet: Hybrid Architecture for Depth Completion

Updated 30 June 2026

HDCNet is a hybrid deep learning architecture that combines transformer and CNN modalities for dense depth completion, effectively handling transparent and reflective objects.
It implements an encode–fuse–decode pipeline with innovative shallow and bottleneck fusion modules to integrate multi-modal semantic and geometric features.
The system achieves up to 75.6% grasp success in robotic manipulation by improving perception in non-Lambertian environments and delivering high-fidelity depth maps.

HDCNet is a hybrid deep learning architecture designed for dense depth completion in challenging robotic settings, particularly for transparent and reflective objects—domains where conventional RGB-D sensors exhibit significant failure modes due to non-Lambertian surface properties. Integrated into manipulation pipelines, HDCNet produces dense, high-fidelity depth reconstructions, enabling precise perception and robust grasping performance in scenes containing transparent or specular materials (Xie et al., 10 Nov 2025).

1. Architectural Design and Core Modules

HDCNet implements an encode–fuse–decode paradigm with critical innovations in both multi-modal representation and fusion.

Dual-Branch Transformer–CNN Encoder:

The RGB branch uses a Swin Transformer backbone to construct multi-scale feature maps $\{F^{(r)}_1, F^{(r)}_2, F^{(r)}_3, F^{(r)}_4\}$ , leveraging shifted window attention for long-range semantic and appearance information.
The depth branch adopts a ResNet-style CNN to yield corresponding features $\{F^{(d)}_1, ..., F^{(d)}_4\}$ , focusing on detailed local geometry.

Shallow Multimodal Fusion Module (SMFM):

At encoder stages $i=1,2,3$ , modality-aligned features are fused by channel-wise aggregation, bottleneck compression/expansion, gating, and cross-modal summation: $\mathbf{z}_{m} = \mathrm{GAP}(\mathbf{F}_{m}) \in \mathbb{R}^{C},\quad m \in \{r,d\}$

$\mathbf{s}_{\mathrm{sq}} = \mathrm{Concat}(\delta(\mathbf{W}_j\mathbf{z}_m))_{j=1}^4$

$\mathbf{s}_{\mathrm{ex}} = \sigma(\mathbf{W}_5 \mathbf{s}_{\mathrm{sq}})$

$\tilde{\mathbf{F}}_{m} = \mathbf{s}_{\mathrm{ex}} \odot \mathbf{F}_{m}$

$\mathbf{F}_{fused} = \tilde{\mathbf{F}}_{r} + \tilde{\mathbf{F}}_{d}$

This module is lightweight and enables effective integration of appearance and geometric cues at early network stages.

Bottleneck Transformer–Mamba Fusion Module (BTMFM):

At the bottleneck, with lowest spatial resolution, fused features are further integrated by sequential application of:

Multi-head self-attention (MHSA) for global context,
Mamba (state-space model) for long-range sequential dependency modeling,
Feed-forward MLP with layer normalization. The sequence: $\mathbf{F}' = \mathrm{LN}(\mathbf{F}_f + \mathrm{MHA}(\mathbf{F}_f))$

$\mathbf{F}'' = W_{\mathrm{down}}\left( \mathrm{SSM}(\delta(\mathrm{Conv}(W_{\mathrm{up}}\mathbf{F}'))) \right) \odot \delta(W_{\mathrm{up}}\mathbf{F}')$

$\{F^{(d)}_1, ..., F^{(d)}_4\}$ 0

This enables deep fusion of semantics and context, crucial for resolving ambiguous or sensor-missing regions.

Multi-Scale Decoder:

A progressive, multi-stage decoder upsamples and fuses features using spatial and channel attention, adaptive weighting, and residual connections, culminating in a full-resolution dense depth map.

2. Training Regimen and Loss Functions

HDCNet is trained end-to-end with a composite loss comprising:

Mean-Squared Error (MSE): $\{F^{(d)}_1, ..., F^{(d)}_4\}$ 1 where $\{F^{(d)}_1, ..., F^{(d)}_4\}$ 2 is the predicted depth, $\{F^{(d)}_1, ..., F^{(d)}_4\}$ 3 the corresponding ground truth.
Normal-Guided Smoothness: $\{F^{(d)}_1, ..., F^{(d)}_4\}$ 4 where $\{F^{(d)}_1, ..., F^{(d)}_4\}$ 5 computes surface normals.
Combined Objective: $\{F^{(d)}_1, ..., F^{(d)}_4\}$ 6 with empirical tuning of $\{F^{(d)}_1, ..., F^{(d)}_4\}$ 7.

Key hyperparameters:

Optimizer: AdamW, initial learning rate $\{F^{(d)}_1, ..., F^{(d)}_4\}$ 8
Batch size: 8, training for 40 epochs
Input: $\{F^{(d)}_1, ..., F^{(d)}_4\}$ 9, C=24 channels per stage

No data augmentation beyond normalization and resizing is used.

3. Benchmarks and Quantitative Performance

HDCNet achieves state-of-the-art results on public real-world transparent/object-centric datasets:

Dataset	RMSE ↓	REL ↓	MAE ↓	δ₁.₀₅ ↑	δ₁.₁₀ ↑	δ₁.₂₅ ↑
TransCG	0.012	0.017	0.008	92.70%	98.09%	99.89%
ClearGrasp	0.021	0.028	0.016	84.60%	96.21%	99.72%

On TransCG, HDCNet outperforms or matches prior models, notably TDCNet (0.012/0.017/92.25%). On ClearGrasp, HDCNet (0.021/0.028/84.60%) is superior to TDCNet (0.022/0.031/82.26%) and prior baselines. Ablation studies confirm the additive benefits of both SMFM and BTMFM in hierarchical fusion, with each contributing distinct improvements (Xie et al., 10 Nov 2025).

4. Robotic Grasping Empirical Results

In physical experiments using a Franka Emika Panda 7-DOF manipulator with AnyGrasp planning [Fang et al., 2023], depth maps completed by HDCNet result in significantly enhanced grasp success for challenging objects:

Object	AnyGrasp	HDCNet+AnyGrasp
Water bottle 1	0/5	4/5
Reflective foam board	3/5	5/5
Reflective box	1/5	1/5
Beverage bottle 1	0/5	5/5
Milk bottle	0/5	2/5
Water bottle 2	1/5	4/5
Beverage bottle 2	1/5	4/5
Water bottle 3	1/5	5/5
Detergent bottle	0/5	4/5
Overall Success	15.6%	75.6%

By addressing the depth ambiguity inherent to transparent/reflective surfaces, HDCNet elevates overall grasp success from 15.6% to 75.6%. These results empirically link dense, high-fidelity depth completion to downstream manipulation performance (Xie et al., 10 Nov 2025).

5. Ablation Analyses and Innovation Attribution

Ablation on TransCG confirms that both fusion modules (SMFM, BTMFM) yield orthogonal gains:

SMFM	BTMFM	RMSE	REL	MAE	δ₁.₀₅	δ₁.₁₀	δ₁.₂₅
		0.012	0.019	0.008	92.38	98.19	99.89
✓		0.012	0.018	0.008	92.52	98.18	99.89
	✓	0.012	0.019	0.008	92.21	97.99	99.89
✓	✓	0.012	0.017	0.008	92.70	98.09	99.89

The encoder’s architecture—combining Swin Transformer (semantic, appearance) with ResNet (structural, geometric)—demonstrates clear complementarity. Hierarchical fusion, divided into lightweight shallow stages and a context-rich bottleneck, provides a mechanism for integrating multi-modal cues at both local and global scales.

6. Practical Impact, Limitations, and Extensions

HDCNet’s key contributions are:

A hybrid, modality-specialized encoder leveraging strengths of both CNNs and Transformers for maximum expressivity.
Hierarchical, multi-stage fusion, with SMFM for early low-level integration and BTMFM for late-stage context disambiguation.
State-of-the-art depth completion performance with practical robotic relevance, specifically yielding a ∼60% absolute improvement in transparent/reflective object grasping robustness.
Efficient training and inference recipes applicable without reliance on task-specific augmentation.

Noted limitations include the computational expense of full-resolution Transformer modules, motivating exploration of mobile-former variants for real-time latency reduction. Reliance on annotation-heavy supervised settings may be addressed in future work by extending to self-supervised or domain-adaptive regimes. There is also an opportunity to further exploit multi-view or polarization cues to address the most challenging specular or refractive conditions.

This suggests HDCNet sets a new system-level baseline for perception-driven manipulation in non-Lambertian domains, and provides a basis for further domain-robust extensions, particularly where annotated real-world data and real-time constraints are at play (Xie et al., 10 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

HDCNet: A Hybrid Depth Completion Network for Grasping Transparent and Reflective Objects (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HDCNet.