3D Gated Recurrent Fusion Net (GRFNet)

Updated 19 May 2026

The paper introduces a neural architecture that uses a 3D convolutional GRU to fuse RGB and depth features for enhanced semantic scene completion.
It employs dual 2D–3D backbones and a GRF module to recurrently integrate spatial features, demonstrating improved IoU metrics on NYU and NYUCAD datasets.
The multi-stage fusion strategy captures both low- and high-level inter-modal dependencies, offering robust performance for complex 3D sensor fusion tasks.

A 3D Gated Recurrent Fusion Network (GRFNet) is a neural architecture designed for adaptive multimodal data fusion, particularly in the context of semantic scene completion (SSC) and, in closely related work, temporal sensor fusion for sequential decision tasks. GRFNet leverages recurrent gating mechanisms to selectively integrate spatial or multimodal features over stages or time, enabling the network to model complex inter-modal and inter-stage correlations. The GRFNet architecture was introduced for RGB-D semantic scene completion, where it fuses texture cues from RGB and geometric cues from depth using a three-dimensional convolutional GRU within a 2D-to-3D convolutional backbone (Liu et al., 2020). Related gating and recurrent fusion designs have also been used for sequential sensor fusion in autonomous driving (Narayanan et al., 2019).

1. Architectural Overview

The 3D Gated Recurrent Fusion Network (GRFNet) extends conventional encoder-decoder pipelines for semantic scene completion by introducing a recurrent fusion unit explicitly designed for volumetric feature fusion. The architecture comprises:

Two-branch feature extractors: Parallel 2D–3D convolutional backbones process RGB and depth modalities independently. Features are projected from 2D space to a common 3D voxel grid, facilitating elementwise correspondence.
GRF fusion block: At selected layers, the network applies a 3D convolutional GRU-based module to adaptively integrate features from both branches, using learned reset and update gates to control the memory content.
Multi-stage fusion: Fusion is implemented either at a single semantic level or recursively over multiple stages (typically four), capturing both low-level and high-level inter-modal dependencies.
Prediction head: A lightweight 3D Atrous Spatial Pyramid Pooling (ASPP) module and output convolutions produce voxel-wise predictions for occupancy and semantic class.

In the driving behavior context of LGRF (Late Gated Recurrent Fusion) (Narayanan et al., 2019), analogous gated fusion operates on sequential embeddings of video, LiDAR, and CAN/odometry signals using scalar fusion gates and parallel LSTM cells.

2. GRF Fusion Block: Mechanism and Formulation

The central mechanism of GRFNet is the GRF fusion block, which generalizes a GRU to three dimensions and acts on spatially aligned volumetric features. At each fusion step $p$ :

The block accepts as input the feature map $f_{p}$ (from either RGB or depth branch) and the previous hidden state $h_{p}$ .
Gating equations (using $*$ for 3D convolution, $[\cdot ; \cdot]$ for channel-wise concatenation, $\odot$ for Hadamard product): \begin{align*} r_p & = \sigma(W_r * [f_p; h_p]) \ z_p & = \sigma(W_z * [f_p; h_p]) \ \tilde{h}p & = \tanh(W_h * [f_p; r_p \odot h_p]) \ h{p+1} & = z_p \odot h_p + (1-z_p) \odot \tilde{h}_p \end{align*}
$W_r$ , $W_z$ , $W_h$ are $3 \times 3 \times 3$ convolutions with $f_{p}$ 0, $f_{p}$ 1.

This recurrent unit is shared across fusion stages, enabling inter-stage correlations to be captured with minimal parameter overhead. Features at increasing semantic abstraction are fused recurrently, with the hidden state transmitting inter-level and inter-modal context.

In the temporal sensor fusion regime, LGRF analogously introduces scalar fusion gates over modality embeddings, with per-sensor LSTMs and global state summation, as detailed in the respective equations in (Narayanan et al., 2019).

3. Layerwise Architecture, Feature Dimensions, and Fusion Strategies

The GRFNet layerwise processing, for each modality, involves:

Module	Operation	Output Size (D×H×W×C)	Notes
2D DDR backbone	3×3 conv ×2	640×480×8 (2D)	Double DDR in image space
2D→3D projection	-	240×144×240×8	Projected to voxel grid
3D Downsample	3×3 conv	120×72×120×16	Max-pool+stride-2 conv
3D DDR	3×3 conv	120×72×120×16, ...	Deepened voxel features
3D GRF block	GRF fusion	60×36×60×64	Single/multi-stage, after 2–4 scales
LW-ASPP	Bottleneck+ASPP	60×36×60×320	Multi-dilated 3D DDRs & global pooling
Output head	1×1×1 conv, softmax	60×36×60×12	11 object classes + empty

Several alternative fusion strategies are benchmarked for comparison:

Sum fusion: $f_{p}$ 2
Max fusion: voxelwise maximum
Concatenation + projection: $f_{p}$ 3 then pointwise conv
Gated weighting: $f_{p}$ 4, $f_{p}$ 5
Conv-LSTM fusion: convolutional LSTM block
GRF (proposed): convolutional GRU block as above

Multi-stage fusion is performed by recurring application of the GRF block over progressively deeper (abstract) feature levels, sharing all parameters (Liu et al., 2020).

4. Training Procedures and Loss Functions

Semantic scene completion is formulated as a dense 3D semantic labeling problem with a single cross-entropy loss over all voxels in the camera frustum: $f_{p}$ 6 where $f_{p}$ 7 is the set of voxels, $f_{p}$ 8 is the one-hot ground truth, $f_{p}$ 9 is the predicted logit, and $h_{p}$ 0 are class weights to mitigate imbalance (initially $h_{p}$ 1 for empty set to 0.05, increased by 0.05 every 40 epochs).

Training setup:

Datasets: NYU (v1) and NYUCAD (795 train / 654 test, 11 classes + empty voxel)
Optimizer: SGD with momentum 0.9, weight decay $h_{p}$ 2
Learning rate: initialized at 0.01, multiplied by $h_{p}$ 3 every 10 epochs
Batch size: 4
End-to-end training for ~50 epochs; no geometric augmentation described

The driving behavior LGRF employs Adam optimizer with $h_{p}$ 4 (classification) or $h_{p}$ 5 (regression), sequence length 90 for classification (truncated BPTT), or 4 for regression, and early stopping via mAP plateau (Narayanan et al., 2019).

5. Experimental Results and Benchmarks

Quantitative results for GRFNet on semantic scene completion (IoU):

Method	NYU SC IoU	NYU SSC avg IoU	NYUCAD SC IoU	NYUCAD SSC avg IoU
SSCNet	55.1	24.7	73.2	40.0
DDR-SSC	61.0	30.4	79.4	42.8
LSTM fusion	59.6	—	—	—
GRFNet	61.2	32.9	80.1	45.3

Key findings:

Single-stage GRF outperforms sum, max, concatenate, and gated-weighting fusions by 1.5–2.6% IoU on NYU, and multi-stage GRF further adds 1.1% SC IoU and 1.9% SSC IoU over single-stage.
Per-class IoUs demonstrate GRFNet achieves best or second-best class-wise IoU in nearly all categories.
For driving behavior, LGRF shows a +10% mAP improvement on HDD (classification) and ≈20% drop in MSE on TORCS steering regression over best prior fusion baselines (Narayanan et al., 2019).

6. Complexity, Limitations, and Future Directions

GRFNet's parameter count and computational cost scale with the number of fusion stages:

Four-stage GRFNet: ≈820 K parameters, 713 G FLOPs (about 193 G per fusion stage, and +9 K params per stage due to weight sharing).
Intermediate feature volume (60×36×60×64) plus ASPP fitting in ~4 GB GPU memory for batch size 4.

Noted limitations include:

High computational cost of full 3D convolutions in GRF blocks.
Single softmax loss couples geometric scene completion and semantic labeling, which could limit task-specific specialization.
No geometric data augmentation is used; augmenting with 3D rotations/flipping may improve generalization.

Future work is suggested to:

Factorize the GRF block's $h_{p}$ 6 convolutions using a light DDR scheme.
Develop more sophisticated loss terms (geometry reconstruction, decoupled task objectives).
Incorporate data augmentation for improved robustness (Liu et al., 2020).

GRFNet and LGRF represent state-of-the-art approaches that integrate gating and recurrent mechanisms into the sensor and feature fusion pipeline for both spatial (3D) and sequential (temporal) reasoning. GRFNet generalizes previous fusion strategies (sum, max, gating, LSTM) by leveraging gated recurrent memory over multi-scale 3D features. In the sensor fusion literature, the LGRF model introduced the idea of modality-specific LSTMs with scalar fusion gates, outperforming both early integration and temporal convolutional baselines for driver behavior and control prediction tasks.

These methodologies provide improved flexibility and accuracy for multimodal learning problems where complementary cues from structure, appearance, and dynamics must be adaptively integrated (Narayanan et al., 2019, Liu et al., 2020).

Markdown Report Issue Upgrade to Chat

References (2)

3D Gated Recurrent Fusion for Semantic Scene Completion (2020)

Sensor Fusion: Gated Recurrent Fusion to Learn Driving Behavior from Temporal Multimodal Data (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 3D Gated Recurrent Fusion Net (GRFNet).