NeU-NBV: Uncertainty-Driven NBV for Robotics

Updated 9 November 2025

The paper introduces a novel framework that selects next-best views by maximizing predicted rendering uncertainty, eliminating the need for explicit 3D map construction.
It employs an adaptive ray-marching LSTM with a probabilistic output head to efficiently estimate per-pixel uncertainty, enhancing view synthesis and scene reconstruction.
A domain-invariant variant integrates a pole-like landmark detector and deep Q-learning, enabling robust cross-domain self-localization under varying environmental conditions.

The NeU-NBV Framework is a paradigm for active perception in robotics that addresses the problem of next-best-view (NBV) planning for scene exploration and self-localization under domain shift. It combines a mapless information-seeking approach grounded in uncertainty-aware neural rendering ("NeU-NBV" in (Jin et al., 2023)) and a domain-invariant, cue-driven RL policy for cross-domain self-localization ("Domain-invariant NBV Planner for Active Cross-domain Self-localization" (Tanaka, 2021)). The core innovation is to drive the view acquisition policy by maximizing information value—either in terms of predicted renderer uncertainty or robust, domain-invariant landmarks—rather than heuristic or explicit 3D map construction.

1. System Architecture and NBV Problem Formulation

The NeU-NBV framework formalizes NBV selection as an iterative, data-driven process in which the system maintains:

A growing reference set $\mathcal{I}$ of RGB images with associated camera poses.
An image-based neural renderer $f_\theta$ , trained offline on diverse scenes and fixed at test time.

At each acquisition step, a discrete candidate set of views $\mathcal{V} = \{v_k \mid k=1,\ldots,K\}$ is sampled within neighborhood constraints (bounded azimuth/elevation). For each $v_k$ , the $N$ closest existing references in pose space are selected, and the renderer predicts an uncertainty map $U_k$ for all pixels and channels. The mean uncertainty $g(v_k)$ is computed:

$g(v_k) = \frac{1}{H_r W_r 3} \sum_{x \in \text{pixels}} U_k(x)$

The NBV is selected by maximizing this criterion: $v^* = \arg\max_{v_k \in \mathcal{V}} g(v_k)$ .

No explicit 3D map is constructed or updated. Instead, the approach uses the internal uncertainty of a photometric renderer as a proxy for unexplored or ambiguous regions of the scene. After capturing the real image at $v^*$ , the observation is added to $\mathcal{I}$ , and the process continues until the measurement budget $B$ is exhausted.

A related, domain-invariant variant (Tanaka, 2021) targets active self-localization under changing appearance (season, weather). Here, the architecture includes:

A multi-scale pole-like landmark detector (PLD) CNN, yielding a compact 4-dimensional feature $f_t$ summarizing the likelihood of domain-stable geometric cues.
A lightweight deep Q-network (DQN) policy $\pi: f_t \mapsto a_t$ trained to maximize pole detection rates while minimizing movement cost.
An experience replay buffer and Bag-of-Words pose retrieval module.

The pose-estimation process is triggered opportunistically when sufficient landmark cues are detected.

2. Neural Rendering and Uncertainty Estimation

NeU-NBV builds on PixelNeRF but incorporates two critical changes:

Adaptive ray-marching LSTM: Instead of dense volumetric sampling, an LSTM dynamically determines the next sample point along each ray, leveraging previous feature aggregation for efficient view synthesis.
Probabilistic output head: For each pixel, the network predicts both the logit-space mean $\mu_i$ and standard deviation $\sigma_i$ for each color channel. The RGB channel $c_i$ is modeled as logistic-normal:

$z_i = \mathrm{logit}(c_i) \sim \mathcal{N}(\mu_i, \sigma_i^2)$

This enables direct aleatoric per-pixel uncertainty estimation without ensembles or dropout. At inference, per-pixel uncertainty $u_i$ is computed as the variance of sigmoid-transformed samples drawn from $\mathcal{N}(\mu_i, \sigma_i^2)$ .

The network is trained using the negative log-likelihood of the logistic-normal model:

$\mathcal{L}_\textrm{photo} = \sum_{i=1}^3 \left[ \frac{1}{2} \ln(\sigma_i^2) + \ln(y_i(1-y_i)) + \frac{(\operatorname{logit}(y_i)-\mu_i)^2}{2 \sigma_i^2} \right]$

No additional regularization, depth supervision, or adversarial loss is applied.

3. NBV Selection Algorithm and Planning Loop

At runtime, NeU-NBV executes the following procedure until the acquisition budget $B$ is reached:

Input: pretrained renderer f_theta, initial reference set I, budget B, candidate count K, nearest refs N
for step = 1 to (B - |I|):
    V = sample_K_candidate_views(current_pose, K)
    best_score = -∞
    best_view  = None

    for v in V:
        refs = find_N_closest_refs(I, v, N)
        U = f_theta.predict_uncertainty(v, refs)
        score = mean(U)   # average over H_r x W_r x 3
        if score > best_score:
            best_score = score
            best_view  = v

    (I_new, T_new) = capture_image(best_view)
    I.append((I_new, T_new))
end
return I

This policy requires only local operations (nearest neighbor pose search, feedforward inference, empirical averaging) and is mapless—no volumetric or geometric scene model is built or maintained. Because $f_\theta$ is pretrained, there is no per-scene retraining.

4. Domain-Invariant NBV for Active Self-Localization

The variant in (Tanaka, 2021) introduces several components to address visual domain shifts:

Pole-like Landmark Detector (PLD): A multi-encoder CNN inspired by HED, trained on pole endpoint annotations, robustly detects pole-like structures that are invariant to appearance variations.
Spatial Landmark Aggregation (SLA): The PLD's output is binned horizontally and aggregated to form $f_t \in \mathbb{R}^4$ .
Deep Q-Learning Policy: A model-free DQN maps $f_t$ to discrete forward motion actions $A$ . Rewards favor observations where pole cues are detected and penalize unnecessary moves.
Passive Self-Localization (PSL): Upon pole detection, a Bag-of-Words-based retrieval estimates pose by matching $f_t$ to a database.
Domain Generalization: The PLD is pretrained on a source domain and transfers directly without domain adaptation or adversarial losses. Mapping policy evaluation to a compact geometry-driven feature enables robust performance across environmental changes.

5. Experimental Protocols and Benchmark Results

Datasets:

Real: DTU multiview stereo (49 views/scene; 88 train, 15 test), 400x300 px.
Synthetic: ShapeNet (car, moto, camera, ship), 100 views/object, 200x200 px.
Domain-invariant NBV: University of Michigan NCLT dataset, four seasons, 26k images/sequence.

Training:

NeU-NBV: Adam, LR $1 \times 10^{-5}$ ; LSTM sampling iterations $T=16$ ; 2 days on one RTX A5000; 3-5 random reference views/scene.
DQN NBV: $\gamma=0.99$ , Adam, batch size 32, buffer size $10^5$ , target update every 1k steps; exploration temperature annealed from 1.0 to 0.1.

Evaluation:

Uncertainty Calibration: Spearman's Rank Correlation (SRCC) between predicted uncertainty and true MSE; Area Under Sparsification Error (AUSE).
- Aleatoric uncertainty SRCC: $\approx 0.84$ (competing methods: $0.27$–$0.83$); AUSE: $\approx 0.12$ (vs.\ $0.26$–$0.50$).
Planning Quality: Test-time PSNR and SSIM on held-out images after fixed-budget planning (DTU: 9 images, ShapeNet/indoor: 20 images, 50 candidates/step).
- Uncertainty-based NBV outperforms random and max-distance planners on both DTU and simulator setups.
Impact on Downstream Reconstruction: Instant-NGP trained on data acquired by NeU-NBV yields higher PSNR/SSIM than models trained on random or max-distance acquisitions.
Domain Transfer in NBV DQN: Median rank of ground-truth pose $\approx1.2$ after $\approx5$ moves with learned policy; baseline heuristics yield rank $\approx2.8$ after $\approx6$ moves.

6. Strengths, Limitations, and Future Prospects

Strengths:

NeU-NBV achieves efficient, mapless, uncertainty-driven view planning with no per-scene retraining or explicit 3D map construction.
Uncertainty estimates are strongly correlated with actual reconstruction error, enabling effective budget utilization.
Domain-invariant variant leverages geometry-driven cues, providing robust NBV policies across seasons/lighting without retraining.

Limitations:

In domain-invariant NBV, the PLD may confound vertical structures unrelated to poles, especially in cluttered scenes.
Neither approach explicitly handles full occlusions, e.g., poles obstructed by dynamic obstacles.
The reward structure in RL-based variant is sparse; more informative shaping (e.g., retrieval-score gain) could accelerate learning.
The view planner's action space is restricted (e.g., forward motion only) in the RL variant.

Future Directions:

Integrating richer or multi-cue representations (e.g., combining geometric and photometric uncertainty) could further extend robustness.
Mechanisms for active disambiguation under occlusion or to support higher-dimensional navigation policies are natural extensions.
Exploration of adversarial or contrastive domain-alignment methods may further improve invariance.

7. Context within Active Perception and Neural Rendering

The NeU-NBV framework represents an overview of active perception, deep photometric rendering, and robust landmark-based reasoning. By eschewing explicit geometric models in favor of information-driven rendering uncertainty or domain-invariant geometric cues, the framework addresses key bottlenecks of earlier NBV planners: computational scalability, sensitivity to domain shift, and sample efficiency. It contributes both a practical methodology for data acquisition in scene understanding and a benchmark for uncertainty-driven planning in neural rendering and robotic self-localization tasks.