NBV-Net: Rapid Boundary Exploration in 3D Scanning
- The paper introduces BENBV-Net, which predicts optimal boundary exploration targets directly from point cloud data without needing a reference model.
- BENBV-Net uses hierarchical point and normal encoders combined with boundary-feature extraction, context fusion, and self-attention to efficiently score candidate views.
- Experimental results show BENBV-Net achieves near model-based performance with up to an 8× reduction in runtime, enabling real-time, robust 3D scanning.
The Boundary Exploration Next Best View Network (BENBV-Net) is a specialized deep neural network framework designed for the Next Best View (NBV) problem in 3D robotic scanning, addressing both scan coverage maximization and robust registration through intrinsic overlap awareness. By predicting the optimal boundary exploration target directly from empirical point cloud data—without reliance on a reference model—BENBV-Net achieves near model-based performance at significantly reduced inference times, facilitating efficient and practical deployment in unknown object scanning scenarios (Li et al., 2024).
1. NBV Problem Formulation
The NBV task is formalized using stepwise 3D surface point clouds, where at each step , the acquired scan data are represented as . Each candidate view is parameterized by , comprising the camera position and a focal point constrained to boundary points of the current cloud. The central objective is to select the next view maximizing a utility function balancing coverage and overlap: Coverage ratio is defined as , overlap ratio as , where is the reference model and denotes newly acquired, already-seen points. This dual emphasis enables maximally informative and registration-robust scanning.
2. Model-Based Boundary-Exploration NBV Policy
The foundational model-based approach iteratively searches for NBVs by evaluating candidate boundary views using a composite score: Here, is a sigmoid coverage-weight, shifting emphasis from overlap to coverage once coverage exceeds $0.6$. At each scan iteration, boundaries are detected via an angle threshold (120°), clustered into candidates via K-means, and for each boundary, a local frame is estimated, normals are perturbed by , and the camera is positioned at a controllable working distance along the (possibly perturbed) normal. Candidate views are evaluated in simulation for , with the optimal determining the NBV.
This search process not only provides a near-optimal NBV policy but also generates supervised (view, score) pairs to train BENBV-Net.
3. BENBV-Net Architecture and Training
BENBV-Net receives as input: the downsampled current point cloud ( with coordinates and normals), a set of 20 boundary points (selected per the model-based step), and a context vector per boundary encapsulating local point density and view-order index.
Architectural Components
- Point-Feature Encoder: Processes (xyz only) using PointNet-like MLPs with max pooling to yield global features .
- Normal-Feature Encoder: Processes 's normals through MLPs to obtain global normal features .
- Boundary-Feature Extractor: Encodes xyz+normals for each boundary () via lightweight MLPs for per-boundary features .
- Context Fusion Module: Fuses per-boundary density and normalized view-order index through MLPs, producing .
- Multi-Scale Residual Fusion: Broadcasts inner features, aggregates , , , and via residual MLP blocks for per-boundary tokens.
- Self-Attention Layer: A multi-head self-attention module models dependencies between the 20 boundary candidates.
- Prediction Head: Outputs scalar scores per boundary through MLP with dropout.
Loss Function
The learning objective is a position-aware weighted regression loss: This weighting scheme assigns different importances to early and late views, proven crucial in early-stage overlap performance.
Training is conducted end-to-end with Adam optimizer (learning rate $1$e, batch size 128, 150 epochs), requiring approximately two hours.
4. NBV Prediction and Execution
During inference, boundary detection and clustering produce 20 candidates. BENBV-Net forward-passes these as boundary tokens to obtain scores , selecting the highest-scoring index . The NBV is thus:
- Target:
- Camera: , with the adjustable working distance and the perturbed normal.
Notably, BENBV-Net does not embed the distance as a learnable parameter, but this parameter is immediately configurable at inference to accommodate sensor-specific requirements.
5. Experimental Evaluation and Performance
Benchmarks were performed on ShapeNetV1, ModelNet40, and Stanford 3D Repository datasets, comprising respectively 24,000 scans (train/test) or 128 generalization scans. The evaluation protocol utilized:
- Final coverage (%) after 15 scans
- Early overlap (%) over initial 5 scans
- Chamfer and Hausdorff distances to ground truth
- Scanning efficiency , with coverage and views to 90% coverage
- Number of views to reach specified coverage milestones (50%, 80%, 90%)
- Overlap ratio at scan intervals
| Method | ShapeNet (Coverage/Overlap/Efficiency) | ModelNet40 (Coverage/Overlap/Efficiency) | Repository (Coverage/Overlap/Efficiency) |
|---|---|---|---|
| BENBV | 89.1% / 59.3% / 8.76 | 89.1% / 62.2% / 9.01 | 95.0% / 53.5% / 13.0 |
| BENBV-Net | 85.9% / 55.3% / 7.51 | 87.3% / 58.2% / 8.03 | 94.3% / 47.0% / 11.2 |
| PC-NBV | 87.4% / 33.8% / 7.18 | 88.2% / 33.1% / 7.49 | 91.9% / 32.6% / 8.59 |
| SEE | 62.9% / 55.2% / 4.13 | 65.8% / 56.2% / 4.40 | 77.7% / 57.9% / 5.53 |
BENBV-Net attains 85.9–94.3% coverage, 47.0–55.3% overlap, achieving 7.51–11.2 efficiency; these values approach model-based BENBV and consistently outperform prior works PC-NBV and SEE, especially in practical early-overlap and coverage milestone metrics. BENBV-Net achieves an average reduction in per-object runtime (–$7.8$ s) relative to model-based search (–$67$ s), facilitating near real-time NBV selection.
Ablation studies indicate the position-aware loss and context fusion significantly enhance early overlap; omitting view-order weighting degrades performance by approximately 5% in initial scans.
6. Applications, Flexibility, and Future Directions
BENBV-Net is architected for efficient NBV selection in object-agnostic, model-free 3D scanning contexts where rapid adaptation to scene geometry is critical. The design enables:
- Flexible deployment on varying sensors and working distances, as is trivially adjustable at test time.
- Intrinsic registration robustness via overlap-aware boundary prioritization, balancing discovery and scan alignment as coverage accumulates.
However, certain limitations exist: the pipeline is not fully end-to-end, as boundary extraction is performed separately, and key hyperparameters (e.g., cluster count, angle threshold, loss weighting) require tuning. Anticipated extensions include integration of learned boundary detectors, explicit robot motion planning, and joint optimization of camera placement and orientation.
A plausible implication is that the boundary exploration paradigm promoted by BENBV-Net may generalize to active vision frameworks beyond point clouds, especially where incremental, registration-aware information gain is paramount.
7. Summary and Key Insights
BENBV-Net exemplifies a practical NBV-Net in which hierarchical point/boundary-feature encoders and context-fusion strategies enable rapid, boundary-specific NBV prediction. The network’s position-aware regression loss, fusion of geometric context (local density and boundary sequence), and inter-boundary self-attention are central to its efficacy. BENBV-Net closely approximates model-based NBV search accuracy with an order-of-magnitude speed improvement, facilitating real-time usage in complex and unstructured 3D scanning environments (Li et al., 2024).