RADIO-ViPE: Unified Semantic SLAM

Updated 3 May 2026

RADIO-ViPE is an online semantic SLAM system that fuses visual and language embeddings to generate precise 3D maps from uncalibrated monocular video streams.
It employs a sliding-window factor graph and joint optimization, integrating depth, pose, and multi-modal features for accurate spatial representations.
Adaptive robust kernels are used to handle dynamic elements, ensuring temporally consistent mapping and state-of-the-art performance in challenging environments.

RADIO-ViPE (Reduce All Domains Into One: Video Pose Engine) is an online semantic SLAM system designed for geometry-aware, open-vocabulary grounding in dynamic environments using unconstrained monocular RGB video streams. Unlike prior approaches that require calibrated RGB-D inputs or pre-initialized camera poses, RADIO-ViPE operates without any prior knowledge of camera intrinsics, depth sensors, or calibration targets. The system tightly fuses multi-modal visual and language embeddings—leveraging agglomerative foundation models such as RADIO—with continually optimized geometric scene information. This coupling occurs across initialization, optimization, and in the structure of the underlying factor graph, resulting in a temporally consistent and robust 3D semantic map capable of associating arbitrary language queries with precise spatial locations (Nasser et al., 28 Apr 2026).

1. Design and Architecture

RADIO-ViPE processes an uncalibrated monocular RGB video stream as input and produces a metric 3D point (or voxel) map. Each 3D point is endowed with a compressed, dense visual-language embedding, enabling open-vocabulary semantic grounding for queries such as “chair,” “mug,” or arbitrary natural-language expressions.

The architecture consists of the following high-level components:

Camera Initialization: Intrinsics $\mathbf{K} = [f_x, f_y, c_x, c_y]$ are bootstrapped via GeoCalib over a sparse subset of input frames, then refined along with poses and disparities in joint bundle adjustment (BA). No specialized hardware or calibration patterns are used.
Keyframe Selection and Graph Construction: Dense optical flow (using DROID-SLAM) identifies candidate frames for keyframing based on parallax thresholds. These keyframes populate a sliding-window factor graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ , with additional co-visibility edges inserted when the global RADIO descriptor similarity exceeds a cosine threshold $\eta$ .
Multi-Modal Feature Extraction: Each keyframe yields dense feature grids $\mathbf{Z}_i \in \mathbb{R}^{K \times H' \times W'}$ via the RADSeg segmentation head operating on the foundation RADIO model. Features are downsampled to $H'=H/8,\,W'=W/8$ , and post-processed with PCA to $D=256$ dimensions.
Depth Estimation: An off-the-shelf monocular metric depth model (e.g., UniDepth, MoGe, Metric3DV2) produces inverse depth predictions per pixel, which are refined in downstream optimization.
Semantic Flow Initialization: Semantic correspondences are established by blending geometric reprojection priors with RADIO embedding similarity, according to the formula

$\boldsymbol\Omega^{\rm prior}(\mathbf{u}) = \beta\,\boldsymbol\Omega^{\rm geo}(\mathbf{u}) + (1-\beta)\,\boldsymbol\Omega^{\rm sem}(\mathbf{u})\,.$

Joint Bundle Adjustment: Poses $\{\mathbf{T}_i\}$ , intrinsics $\mathbf{K}$ , and per-pixel disparities $\{d_i\}$ are co-optimized by minimizing a composite energy containing photometric, embedding-similarity, and depth-prior terms.
Non-Keyframe Pose Estimation: Intermediate frames are photometrically aligned to their nearest keyframes with parallelized pose updates to manage computational cost.
Open-Vocabulary Grounding: The 3D map's RADIO embeddings are projected into SigLip latent space. Candidate text queries $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ 0 are embedded as $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ 1, and 3D locations are retrieved through cosine similarity thresholding.

The RADIO-ViPE pipeline tightly integrates visual, linguistic, and geometric signal pathways at all stages:

Visual Embedding: At pixel location $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ 2, the embedding is $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ 3.
Language Embedding: Queries are mapped as $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ 4 through SigLip.

For spatial-temporal alignment, feature-matched projections are computed using bundle-adjusted pose estimates and disparities, with the projected coordinates given by

$\mathcal{G} = (\mathcal{V}, \mathcal{E})$ 5

Cosine similarity of features is defined as

$\mathcal{G} = (\mathcal{V}, \mathcal{E})$ 6

Global descriptors for keyframes are computed by mean-pooling and $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ 7-normalization; high-similarity non-adjacent frames ( $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ 8) are linked in the factor graph, reinforcing multi-view semantic consistency.

3. Joint Optimization in Factor Graphs

The multi-modal factor graph formulation encompasses:

Pose Variables: $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ 9 in $\eta$ 0
Disparities: $\eta$ 1
Camera Intrinsics: $\eta$ 2

For each edge $\eta$ 3, three key factor types are jointly optimized:

Photometric-Flow Term:

$\eta$ 4

Embedding-Similarity Term:

$\eta$ 5

$\eta$ 6

Depth-Prior Regularizer:

$\eta$ 7

The global objective is

$\eta$ 8

where each non-linear residual is wrapped by Barron's general robust loss $\eta$ 9 with per-pixel shape $\mathbf{Z}_i \in \mathbb{R}^{K \times H' \times W'}$ 0, and weighted according to the adaptive kernel regime. Initialization is calibration-free through GeoCalib-initialized intrinsics, foundation-model disparities, and DROID-SLAM geometric flows.

4. Handling Dynamics via Adaptive Robust Kernels

RADIO-ViPE departs from explicit dynamic segmentation by employing adaptive, temporally consistent robust loss kernels for every per-pixel residual:

Temporal Stability Field:

$\mathbf{Z}_i \in \mathbb{R}^{K \times H' \times W'}$ 1

$\mathbf{Z}_i \in \mathbb{R}^{K \times H' \times W'}$ 2

Static surfaces achieve $\mathbf{Z}_i \in \mathbb{R}^{K \times H' \times W'}$ 3, dynamic or rearranged areas $\mathbf{Z}_i \in \mathbb{R}^{K \times H' \times W'}$ 4.

Three-Regime Shape Mapping: $\mathbf{Z}_i \in \mathbb{R}^{K \times H' \times W'}$ 5 is mapped to $\mathbf{Z}_i \in \mathbb{R}^{K \times H' \times W'}$ 6 (Barron loss shape) by a differentiable piecewise-linear function, promoting a smooth transition between $\mathbf{Z}_i \in \mathbb{R}^{K \times H' \times W'}$ 7 (static), Huber, and Cauchy-like (dynamic) losses.

All photometric and embedding residuals are downweighted in low-stability (dynamic) areas, effectively attenuating contributions from moving or rearranged regions, while reinforcing stable structure.

5. Implementation Specifics

Sliding Window: 8–10 keyframes to manage memory/compute requirements.
Dense Embeddings: RADSeg operates on overlapping patches with self-attention; output features are projected to 256D post initial BA burn-in.
Bundle Adjustment: Solved using Gauss-Newton on the sparse factor graph, with all terms evaluated at $\mathbf{Z}_i \in \mathbb{R}^{K \times H' \times W'}$ 8 resolution.
Graph Growth: New keyframes are compared in cosine space with non-recent frames for non-trivial co-visibility edges.
Non-Keyframe Processing: Photometric alignment to reduce overhead.
Runtime: Achieves approximately 8–10 FPS on a Xeon Gold 5320 CPU and NVIDIA RTX 4090 GPU.
Open-Vocabulary Grounding: 3D point embeddings are projected to SigLip latent space, text queries are similarly embedded, and a cosine threshold is used for 3D query localization.

6. Empirical Results

Dynamic SLAM on TUM-RGBD

Method	Avg. ATE (cm)
Dyna-SLAM II	2.00
V3D-SLAM	2.10
ViPE (SAM-guided)	2.17
RADIO-ViPE	1.90
RADIO-ViPE₍ark₎	1.63

RADIO-ViPE₍ark₎ achieves a state-of-the-art average ATE of 1.63 cm, outperforming all listed prior methods.

Open-Vocabulary 3D Segmentation on Replica

Method	mIoU↑ (no BG)	f-mIoU↑	Acc↑	Online	Calib-free	Depth-free	Pose-free
ConceptFusion	21.07	31.51	35.65	✗	✗	✗	✗
RayFronts	39.37	62.03	68.80	✓	✗	✗	✗
RADIO-ViPE₍GT₎	29.51	52.24	59.80	✓	✗	✗	✗
RADIO-ViPE	24.25	50.63	59.25	✓	✓	✓	✓

RADIO-ViPE is the only online, calibration/pose/depth-free method with competitively high performance, incurring only a 1–2% mIoU drop relative to the ground-truth-aided variant.

7. Limitations and Prospects

Structural Classes: Background categories such as wall and floor remain challenging, especially on Replica with background labels. This suggests that future improvements could derive from integrating background consistency cues.
Map Scale Drift: The system can accumulate scale drift over extended video sequences in the absence of explicit loop closures. A plausible implication is that incorporating semantic cues via open-vocabulary object or language landmarks could reduce global drift.
Embedding Optimization: The current visual-language embedding reduction chain (RADSeg→PCA→SigLip) offers further room for task-driven compression or instance-level segmentation improvements.
Extensibility: Data fusion from multiple cameras, inertial sensors, or limited stereo signals are straightforward architectural extensions to RADIO-ViPE.

RADIO-ViPE constitutes the first online, calibration-free SLAM framework that jointly optimizes scene geometry, vision–language embeddings, and adaptive robust kernels for dynamic, in-the-wild monocular video, establishing new state-of-the-art results in robustness and semantic mapping performance (Nasser et al., 28 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (1)

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RADIO-ViPE.

RADIO-ViPE: Unified Semantic SLAM

1. Design and Architecture

3. Joint Optimization in Factor Graphs

4. Handling Dynamics via Adaptive Robust Kernels

5. Implementation Specifics

6. Empirical Results

Dynamic SLAM on TUM-RGBD

Open-Vocabulary 3D Segmentation on Replica

7. Limitations and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

RADIO-ViPE: Unified Semantic SLAM

1. Design and Architecture

2. Multi-Modal Embedding Fusion

3. Joint Optimization in Factor Graphs

4. Handling Dynamics via Adaptive Robust Kernels

5. Implementation Specifics

6. Empirical Results

Dynamic SLAM on TUM-RGBD

Open-Vocabulary 3D Segmentation on Replica

7. Limitations and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research