Papers
Topics
Authors
Recent
2000 character limit reached

Volume-DROID SLAM Framework

Updated 6 January 2026
  • Volume-DROID is a SLAM framework that integrates real-time trajectory estimation with learned volumetric mapping to produce dense, semantically annotated 3D maps.
  • It utilizes CNN-based dense feature extraction, recurrent pose and depth inference, and dense bundle adjustment to achieve robust performance.
  • The system operates at low latency (~200 ms per frame) on commodity GPUs while processing monocular, stereo, or RGB-D inputs and fusing 2D segmentation into 3D volumes.

Volume-DROID is a real-time Simultaneous Localization and Mapping (SLAM) framework that integrates Differentiable Recurrent Optimization-Inspired Design (DROID-SLAM) for accurate trajectory estimation with learned volumetric mapping via Convolutional Bayesian Kernel Inference (ConvBKI). Processing monocular, stereo, or RGB-D camera inputs, Volume-DROID simultaneously estimates the robot’s camera trajectory and generates a dense, semantically annotated 3D volumetric map in real time, with open-source implementation available in Python (Stratton et al., 2023).

1. Pipeline Architecture and Data Flow

Volume-DROID ingests a stream of images or video frames that may be monocular, stereo, or RGB-D. The processing sequence is as follows:

  • Dense Feature Extraction: Each input frame IiI_i passes through a dense CNN fθf_\theta, yielding per-pixel feature maps FiF_i.
  • Correlation-Volume Construction: For frame pairs (i,j)(i, j) in the active keyframe window, a 4D correlation volume is constructed as Cij(u,v)=Fi(u),Fj(v)C_{ij}(u, v) = \langle F_i(u), F_j(v) \rangle with a lookup operator LrL_r for arbitrary locations.
  • Recurrent Pose and Depth Inference (DROID-SLAM):
    • Initialize per-pixel optical flow pij0(u)p^0_{ij}(u) and inverse depth di0(u)d^0_i(u).
    • For t=0,,T1t = 0, \ldots, T-1:
    • Lookup and encode correlation context: cijt(u)=Lr(Cij,pijt(u))c^t_{ij}(u) = L_r(C_{ij}, p^{t}_{ij}(u)), Fijt(u)CNN(cijt(u),pijt(u),dit(u))F^t_{ij}(u) \leftarrow \mathrm{CNN}(c^t_{ij}(u), p^{t}_{ij}(u), d^t_i(u))
    • ConvGRU update: hijt+1(u)=ConvGRU(hijt(u),Fijt(u))h^{t+1}_{ij}(u) = \mathrm{ConvGRU}( h^t_{ij}(u), F^t_{ij}(u) )
    • Predict flow correction Δpijt(u)\Delta p^t_{ij}(u) and confidence wijt(u)w^t_{ij}(u).
    • Dense Bundle Adjustment: Collect Jacobians and residuals to solve a single damped Gauss–Newton optimization for poses {Ti}\{T_i\} and depths {di}\{d_i\}.
  • Point Cloud Generation: Valid for RGB-D frames. Depth pixels uu from frame ii are back-projected to world coordinates xi(u)x_i(u), using camera intrinsics KK and pose TiT_i.
  • 2D-to-3D Semantic Lifting: An off-the-shelf 2D semantic segmentation network computes per-pixel class probability vectors si(u)s_i(u), which are inherited by 3D points xi(u)x_i(u).
  • Volumetric Assimilation (ConvBKI): The environment is discretized into a 3D voxel grid. For each voxel vv, semantic measurement histograms MvM_v are accumulated, and Bayesian updates are performed using a learned 3D kernel.
  • Output/Visualization: Resultant trajectory {Ti}\{T_i\} is emitted as ROS TF frames. The 3D voxel grid with semantic data is visualized in RViz.

2. DROID-SLAM Optimization Formulation

DROID-SLAM operationalizes dense joint optimization for pose and depth as follows (notation adapted from Yi et al. (Teed et al., 2021)):

  • Photometric Residual:

rijI(u)=Ij(π(Tj1Tiπ1(u,di(u))))Ii(u)r^I_{ij}(u) = I_j(\pi(T_j^{-1} T_i \pi^{-1}(u, d_i(u)))) - I_i(u)

  • Geometric (Inverse-Depth) Residual:

rijG(u)=dj(π(Tj1Tiπ1(u,di(u))))di(u)r^G_{ij}(u) = d_j(\pi(T_j^{-1} T_i \pi^{-1}(u, d_i(u)))) - d_i(u)

  • Total Loss:

E({T},{d})=(i,j)EdgesuΩwijI(u)ρI(rijI(u))+wijG(u)ρG(rijG(u))E(\{T\}, \{d\}) = \sum_{(i, j) \in \text{Edges}} \sum_{u \in \Omega} w^I_{ij}(u) \, \rho_I(r^I_{ij}(u)) + w^G_{ij}(u) \, \rho_G(r^G_{ij}(u))

where ρ\rho are robust penalty functions (e.g., L₁ or Huber), and wI,Gw^{I,G} are learned confidences.

  • Gauss–Newton Update:

(H+λD)δ=g(H + \lambda D) \, \delta = -g

xx0+δx \leftarrow x^0 + \delta

where H=JTWJH = J^T W J and g=JTWr(x0)g = J^T W r(x^0). Updates are realized via a “Dense Bundle Adjustment” layer integrated with a ConvGRU for recurrent refinement.

3. Batched Point Cloud and Semantic Projection

RGB-D pixel data is back-projected efficiently:

  • Camera Intrinsics:

K=(fx0cx 0fycy 001),Kh=(K0 01)K = \begin{pmatrix} f_x & 0 & c_x \ 0 & f_y & c_y \ 0 & 0 & 1 \end{pmatrix}, \quad K_h = \begin{pmatrix} K & 0 \ 0 & 1 \end{pmatrix}

  • Back-projection:

xcam=zK1[u,v,1]Tx_{cam} = z \cdot K^{-1} [u, v, 1]^T

xworld=Ti1[xcam;1]x_{world} = T_i^{-1} [x_{cam}; 1]

  • Vectorized Form:

X=(Ti1Kh1)U[z1zN]X = (T_i^{-1} K_h^{-1}) \, U \odot [z_1 \ldots z_N]

where UU gathers all pixel coordinates and inverse depths.

Per-pixel class-probabilities si(u)s_i(u) (from the segmentation network) are retained with each 3D point xi(u)x_i(u). This allows efficient mapping from 2D semantic predictions to 3D spatial annotations in the voxel grid.

4. Convolutional Bayesian Kernel Inference (ConvBKI)

The semantic measurement MvRCM_v \in \mathbb{R}^C for voxel vv is assembled by summing class-probability vectors for all points falling within vv:

Mv=i,u:xi(u)vsi(u)M_v = \sum_{i, u: x_i(u) \in v} s_i(u)

ConvBKI implements a learned Bayesian update—parametrized by a 3D convolutional kernel K(τ)K(\tau)—over the histogram field. In log-probability,

logPvnew=logPvold+τΩkK(τ)Mv+τ\log P_v^{new} = \log P_v^{old} + \sum_{\tau \in \Omega_k} K(\tau) \cdot M_{v + \tau}

where Ωk\Omega_k denotes a selected voxel neighborhood (e.g., 5×5×55 \times 5 \times 5 window). The updated voxel probabilities are normalized:

Pvnewsoftmax(logPvnew)P_v^{new} \leftarrow \mathrm{softmax}( \log P_v^{new} )

This ensures robust integration of spatial context and measurement uncertainty.

5. Semantic Segmentation and Fusion in Volumetric Mapping

Volume-DROID leverages off-the-shelf 2D segmentation networks (e.g., dilated ConvNets) to produce softmax class-probabilities per pixel. Through back-projection, these vectors are assigned to corresponding 3D points. The ConvBKI operation fuses all such per-voxel semantic measurements in a Bayesian fashion, continuously refining per-voxel class posteriors. The prior for each voxel is initialized as uniform and updated as new frames arrive, yielding consistent probabilistic semantic labeling in the 3D map.

6. System Implementation and Real-Time Performance

The software stack is built in Python with PyTorch, utilizing ROS Noetic for messaging and RViz for visualization. The system is containerized using Docker, including a GUI interface via noVNC.

Critical operations—feature extraction, correlation computation, ConvGRU recurrent updates, and 3D convolutions for ConvBKI—are executed on the GPU. Batched point cloud computation leverages large matrix multiplications for high throughput. Asynchronous ROS callbacks, with separate “SLAM” and “mapping” threads, decouple image ingestion from optimization and mapping. Keyframe window size is limited (8–12 frames) and FP16 precision is used for efficiency.

Observed performance using a server with 4×RTX 3090 GPUs includes:

  • DROID-SLAM inference: ≈10 Hz
  • Point cloud and ConvBKI update: ≈5 Hz (on a 200×200×100200 \times 200 \times 100 voxel grid)
  • End-to-end pipeline: \sim5 Hz, <200<200 ms latency per frame
  • Volumetric mapping introduces <5%<5\% overhead in compared to DROID-SLAM alone once ConvBKI is fully initialized on GPU.

7. Quantitative Evaluation and Accuracy

On the TartanAir “neighborhood” subset (using ground-truth segmentation as a stand-in for real 2D networks), measured metrics are:

Metric Value
Absolute Trajectory Error (ATE) 0.01755 m
Relative Pose Error (RPE) 0.003769 m / 0.06087 rad (translation/rotation)
KITTI sub-trajectory score 0.01088 m / 0.002345 rad
Semantic-map accuracy Patchy (due to class-mapping and synthetic segmentation)
Runtime (4 × RTX 3090) ~5 Hz overall, <200 ms latency per frame

These results suggest competitive real-time localization and mapping performance, with semantic map quality likely to improve when using a trained 2D segmentation model (Stratton et al., 2023).


Volume-DROID exemplifies a tightly coupled integration of learned recurrent Gauss–Newton SLAM algorithms and differentiable volumetric mapping. The mathematical foundations—joint photometric/geometric registration, efficient batched 3D point cloud generation, and convolutional Bayesian semantic updates—enable accurate and low-latency SLAM with semantic labeling on commodity GPU hardware.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Volume-DROID.