On-the-Fly 3D Reconstruction

Updated 11 December 2025

On-the-fly 3D reconstruction frameworks are systems that incrementally convert sensor streams into detailed 3D scene models with real-time feedback and robust global consistency.
They leverage hierarchical optimization and fusion techniques across modalities—such as RGB-D, LiDAR, and neural implicit methods—to enhance accuracy and speed.
Applications span robotics, medical imaging, and photogrammetry, offering low-latency performance and scalable solutions for dynamic, large-scale environments.

On-the-Fly 3D Reconstruction Framework

On-the-fly 3D reconstruction frameworks enable real-time, causal, and incremental conversion of sensing input (images, depth, or other modalities) into high-fidelity geometric and appearance representations of static or dynamic scenes. Such frameworks interleave data acquisition, pose or camera trajectory estimation, and geometry reconstruction, optimizing intermediate scene representations and visualizations at interactive rates. Rather than relying on static, fully captured datasets and batch optimization routines, these systems process sensor streams as they arrive, updating scene models in-situ and providing immediate feedback or visual output. On-the-fly paradigms span the domains of computer vision, robotics, medical imaging, photogrammetry, graphics, and industrial metrology, and underlie technologies ranging from SLAM to neural rendering, fusing contributions from geometry processing, differentiable graphics, and real-time optimization.

1. Core System Architectures and Sensor Modalities

On-the-fly 3D reconstruction systems are instantiated across diverse sensor types and computational architectures:

RGB-D and Lidar-based SLAM: Classical on-the-fly systems fuse dense or sparse depth sensors with camera input to incrementally construct volumetric (e.g., TSDF), surfel, or hybrid representations of static or dynamic environments. Systems such as BundleFusion employ hierarchical global pose optimization combined with per-chunk surface re-integration in sparse voxel grids to ensure globally consistent, real-time meshes at frame rates >30 Hz (Dai et al., 2016). HRBF-Fusion introduces closed-form Hermite RBF implicit surfaces, allowing robust, noise-resistant incremental fusion without dense grids (Xu et al., 2022). PSDF Fusion models SDF and inlier probabilities jointly, yielding on-the-fly uncertainty-aware meshing through a hybrid spatial data structure (Dong et al., 2018).
Monocular and Multi-View Photogrammetry: Recent frameworks such as SfM on-the-fly SfMv2 utilize hierarchical HNSW-based image matching and adaptive local bundle adjustment to build globally consistent sparse 3D maps in real time, supporting collaborative or multi-agent data-acquisition workflows and submap merging (Zhan et al., 4 Jul 2024). Feedback-enabled extensions generate incremental coarse meshes, assess mesh quality, and plan next-best-view trajectories for UAV photogrammetry (Lou et al., 2 Dec 2025).
Neural Implicit and Gaussian Splatting: Neural radiance field (NeRF) and 3D Gaussian Splatting (3DGS) methods are extended to the on-the-fly setting via progressive integration of images, local/global optimization over scene primitives, and incremental pose estimation. On-the-Fly GS interleaves local and global optimization stages, introducing hierarchical weighting, adaptive per-image learning rates, and load-balancing terms to maintain both speed and global coherence (Xu et al., 17 Mar 2025). ARTDECO unifies SLAM-based pose refinement with a hierarchical Gaussian level-of-detail scene structure (Li et al., 9 Oct 2025). EGG-Fusion blends dense RGB-D surfel fusion, information-filter-based uncertainty representation, and differentiable Gaussian rendering in real time (Pan et al., 1 Dec 2025).
Medical and Scientific Imaging: DeepOrganNet realizes mesh reconstruction of 3D/4D lung models from single-view X-ray or CBCT images, using deep latent encoding and trivariate tensor-product FFD to produce accurate organ geometries within milliseconds, dramatically reducing the number of required input projections and patient dose (Wang et al., 2019). MRI-focused frameworks adapt k-space sampling in real time using local spectral moment criteria, optimizing acquisition patterns to minimize g-factor and noise amplification (Levine et al., 2017).
Multi-Camera and Dynamic Scenes: On-the-fly frameworks for multi-camera rigs include hierarchical (uncalibrated) camera initialization, multi-camera BA, and redundancy-free Gaussian sampling, allowing fusion of raw multi-camera videos into unified, drift-free 3D scenes over hundreds of meters in minutes (Guo et al., 9 Dec 2025). Streaming 4DGS approaches leverage temporal inheritance, dynamic/static primitive differentiation, and error-guided densification for real-time, framewise photorealistic reconstructions of dynamic scenes (Liu et al., 22 Nov 2024, Sun et al., 3 Mar 2024).

2. Incremental Optimization, Global Consistency, and Data Structures

Central to on-the-fly operation is the ability to incrementally update the scene model while ensuring global consistency and minimizing drift.

Pose and Geometry Optimization: Many architectures split optimization into parallel front- and back-end threads. BundleFusion and GO-SLAM incorporate hierarchical local/global bundle adjustment, integrating both dense geometric/photometric and sparse (SIFT, RAFT optical flow) constraints, with loop closure for drift correction (Dai et al., 2016, Zhang et al., 2023). On-the-fly SfM pipelines trigger local BA within a small adaptive neighborhood, using self-adaptive weighting derived from hierarchical association trees to maintain sub-second update rates (Zhan et al., 4 Jul 2024).
Progressive Scene Fusion: Systems such as HRBF-Fusion maintain continuous implicit surface models, supporting closed-form incremental fusion, fast kernel neighborhood search, and per-pixel confidence. PSDF approaches maintain a joint posterior over SDF and inlier probability per voxel, enabling Bayesian fusion as new measurements arrive and discarding low-confidence data for redundancy reduction (Dong et al., 2018, Xu et al., 2022).
Hybrid and Hierarchical Representations: ARTDECO and related frameworks combine per-Gaussian SH color/opacity/geometry fields with level-of-detail-aware rendering: fine Gaussians are only rasterized near the camera, coarse ones dominate at distance, and Laplacian-based probabilistic insertion avoids spatial redundancy (Li et al., 9 Oct 2025). Anchors and clustering mechanisms are employed to control the memory and computation cost in large-scale scenes (Meuleman et al., 5 Jun 2025).
Failure Recovery and Relocalization: Hierarchical feature matching and chunk-based relocalization strategies are crucial for robust operation in the presence of tracking loss or ambiguous observations (Dai et al., 2016).

3. Differentiable Rendering and Learning for Real-Time Fitting

Recent progress in differentiable rendering and neural scene fitting has made real-time, high-fidelity on-the-fly reconstruction feasible:

Differentiable Splatting and Surfel Fusion: Differentiable rasterizers compute per-pixel visibility, color, and normals based on analytic projection of Gaussian surfels or 3D kernels, supporting direct gradient-based optimization with photometric and multi-view constraints (Pan et al., 1 Dec 2025, Xu et al., 17 Mar 2025).
Meta-Optimization and Adaptive Scheduling: On-the-Fly GS and EGG-Fusion use adaptive learning rate schedules per image/surfel, balancing the training of new and old images, and integrate global passes to avoid overfitting or memory imbalance (Xu et al., 17 Mar 2025, Pan et al., 1 Dec 2025).
Online Dynamic Fusion: Streaming 4D frameworks (e.g., 3DGStream, DASS) introduce neural transformation caches, selective inheritance of primitives, and separation of dynamic/static decomposition with hash-MLP deformation fields, optimizing storage and convergence speed for dynamic or topologically evolving scenes (Sun et al., 3 Mar 2024, Liu et al., 22 Nov 2024).

4. Domain-Specific Strategies and Evaluation Protocols

On-the-fly frameworks address domain- and application-specific constraints:

Medical Imaging: DeepOrganNet employs end-to-end synthetic data generation, per-organ FFD, and template selection via weighted Chamfer loss to bypass expensive multi-projection CT and 3D segmentation, enabling rapid, dose-efficient 3D organ mesh inference from minimal input (Wang et al., 2019). Adaptive k-space MRI sampling algorithms optimize the distribution of new measurements based on the spectral properties of the information matrix, using best-candidate heuristics and priority queues for sub-second mask generation (Levine et al., 2017).
Geospatial and Photogrammetric Feedback: Predictive path planning using mesh quality indicators (GSD, visibility, reprojection error) and greedy clustering or trajectory optimization enables UAVs to actively close coverage gaps, reducing reflight cost by 50% and improving mesh completeness mid-flight (Lou et al., 2 Dec 2025).
Collaborative and Multi-Agent Mapping: Real-time merging of partial reconstructions agreed across multiple agents is achieved via HNSW feature indexing, shared 3D point correspondences, and Procrustes-based 3D similarity adjustment. This allows fully arbitrary, crowd-sourced data collection (Zhan et al., 4 Jul 2024).

Evaluation strictly utilizes standardized accuracy, completeness, and error metrics according to the data domain: Chamfer/EMD/IoU for mesh, ATE/RPE for localization, PSNR/SSIM/LPIPS for photorealistic rendering, IoU/log-odds/correlation for occupancy maps, and g-factor or RMSE for medical imaging (Pan et al., 1 Dec 2025, Xu et al., 17 Mar 2025, Levine et al., 2017, Lou et al., 2 Dec 2025).

5. Quantitative Performance and Application Metrics

On-the-fly 3D reconstruction frameworks consistently demonstrate real-time or near-real-time performance:

Latency and Throughput: Neural 3DGS-based systems achieve optimization times of 2–3 seconds per incoming image (including pose estimation), and maintain rendering rates of 180–250 FPS for streaming scenes (Xu et al., 17 Mar 2025, Pan et al., 1 Dec 2025, Sun et al., 3 Mar 2024). Medical organ mesh reconstruction with DeepOrganNet is achieved in <25 ms per 10K-vertex dual-lung mesh, compared to minutes for classical CT (Wang et al., 2019). Graph-cut meshing and path planning in UAV photogrammetry pipelines run in <1–1.5 s per step at ~1K images (Lou et al., 2 Dec 2025).
Accuracy and Fidelity: Scene mesh/point cloud reconstructions reach sub-centimeter surface RMS errors and high PSNR/SSIM in photo-realistic views (e.g., EGG-Fusion achieves 0.6 cm surface error on Replica, PSNR >29 dB), outperforming or matching state-of-the-art offline and batch methods (Pan et al., 1 Dec 2025). Monocular neural methods realize state-of-the-art Chamfer distances and completeness in indoor benchmarks (Zou et al., 2022).
Scalability and Robustness: Large-scale mapping (e.g., CityWalk—a 1.1 km trajectory, 4K frames) is reconstructed in under 25 min end-to-end with <1 px drift, an order of magnitude faster than conventional pipelines (Meuleman et al., 5 Jun 2025). Multi-camera rigs reconstruct 100K m² scenes within 2 minutes, sustaining alignment across camera views (Guo et al., 9 Dec 2025).

6. Limitations and Research Directions

Despite rapid progress, several challenges remain:

Dynamic and Non-Rigid Scenes: Dynamic-object-aware fusion requires explicit modeling of motion, object boundary tracking, or deformation fields (e.g., dynamics-aware shift in DASS (Liu et al., 22 Nov 2024), multi-group RANSAC (Caccamo et al., 2018)). Robustness to severe occlusion and topological changes remains an active research area.
Input Noise and Uncertainty: Systems such as HRBF-Fusion and PSDF Fusion model measurement uncertainty and adapt fusion weights or surface extraction accordingly, but optimal integration with learning-based methods is ongoing (Xu et al., 2022, Dong et al., 2018).
Domain Generalization: Monocular/depth/SLAM fusion approaches often assume static, rigid, and moderately textured environments. Failures arise in presence of extreme motion, unmodeled illumination, or out-of-distribution scenes. Reliance on pretrained foundation models introduces prior sensitivity (Li et al., 9 Oct 2025).
Resource Constraints: Memory and compute minimization are addressed via progressive clustering, multi-level LOD, and neural transformation caches, but explicit scaling to extremely dense scenes or resource-limited platforms is still a bottleneck (Meuleman et al., 5 Jun 2025, Sun et al., 3 Mar 2024).

Future work targets joint modeling of appearance and geometry for dynamic, multimodal streams; improved priors and uncertainty modeling; integration with active exploration policies; and scaling to collaborative, multi-platform data acquisition.

References: