Papers
Topics
Authors
Recent
2000 character limit reached

TurboMap: GPU-Accelerated Local Mapping for SLAM

Updated 10 November 2025
  • The paper introduces TurboMap, a local mapping module that leverages GPU offloading for tasks like triangulation, map-point fusion, and bundle adjustment to boost visual SLAM performance.
  • It partitions computational tasks by executing parallel correspondence searches and Schur-complement bundle adjustment on the GPU while using the CPU for efficient keyframe culling.
  • Experimental evaluations on EuRoC and TUM-VI datasets show TurboMap delivers 1.3×–1.6× speedups with minimal impact on mapping accuracy, validating its real-time applicability on both desktop and embedded platforms.

TurboMap is a GPU-accelerated and CPU-optimized local mapping module designed to address critical computational bottlenecks in visual Simultaneous Localization and Mapping (SLAM) systems. Built atop ORB-SLAM3 and leveraging CUDA, TurboMap offloads map point triangulation, map point fusion, and local bundle adjustment to the GPU, while refining keyframe culling on the CPU. The system demonstrates substantial speedups—up to 1.6× on public benchmark datasets—without compromising mapping accuracy and is validated on both desktop- and embedded-class hardware.

1. Performance Bottlenecks in Local Mapping

Visual SLAM systems, including ORB-SLAM3, implement three tightly coupled modules: tracking, local mapping, and loop closing, all under stringent real-time requirements. Local mapping typically assumes responsibility for:

  • Map-point creation (stereo/multiview correspondence search and triangulation)
  • Map-point fusion (duplicate detection and merging)
  • Local bundle adjustment (LBA)
  • Local keyframe culling

Profiling of ORB-SLAM3 reveals that approximately 95% of the wall-clock time in local mapping is spent across these four stages. Among them, the search phases in point creation and fusion are highly parallel but compute-intensive. LBA accounts for 60–75% of computation time due to the cost of solving large, sparse least-squares problems. Naive keyframe culling routines exhibit poor scaling with map size, exacerbating the performance bottleneck as the map grows.

2. System Architecture and Task Partitioning

TurboMap reorganizes the local mapping pipeline by explicitly partitioning computational tasks between the GPU and CPU, illustrated schematically as follows:

  • Upon insertion, each new keyframe (∼1 MB of keypoints and descriptors) is transferred once into a persistent, preallocated GPU storage buffer. An index array maintains bookkeeping over active keyframes; removal only modifies indices, deferring deallocation until shutdown.
  • GPU-executed kernels address: (i) correspondence search for triangulation (incorporating epipolar checks), (ii) descriptor-based map-point fusion across neighbors, and (iii) local bundle adjustment through a CUDA-based Schur-complement solver.
  • CPU responsibilities are minimized to light outlier filtering before mapping and a highly optimized keyframe culling module, using per-map-point observation counters.

This architecture is motivated by empirical identification of component-level bottlenecks and by the parallel nature of most time-consuming operations, as detailed in Table 1 below.

Local Mapping Stage Offloaded Component Primary Processor
Keypoint/Descriptor Storage Preallocated Buffer GPU
Search for Triangulation Kernel, Epipolar Check GPU
Map-point Fusion Descriptor + 3D Search GPU
Local Bundle Adjustment Schur Complement Solver GPU
Keyframe Culling Counter-Based Update CPU

All data trace to (Hosseininejad et al., 3 Nov 2025).

3. GPU-Accelerated Algorithms

3.1 Persistent Keyframe Storage

To mitigate high-latency PCIe/DRAM transfers, keyframes that may be referenced during local mapping reside in a reserved GPU buffer. The system utilizes a simple GPU index array for bookkeeping, ensuring consolidated host-to-device transfers with removals handled via logical index operations.

3.2 Feature Correspondence and Triangulation

Parallel GPU kernels process each feature in a newly inserted keyframe, comparing its 256-bit ORB descriptor against descriptors of all first-order neighboring keyframes. The fundamental matrix establishes the epipolar constraint (xiFxj<ε|x_i^\top F x_j| < \varepsilon), with reduction strategies at the CUDA block-level to select optimal matches. Only one host-to-device transfer is performed per new keyframe insertion.

3.3 Map-Point Fusion

The fusion kernel extends the correspondence algorithm to second-order neighbors, computing the minimum Hamming distance (descriptor) and spatial (3D) distance for each new map point. Decision logic is performed in-kernel, supporting asynchronous map-point merging across a broader local neighborhood.

3.4 Local Bundle Adjustment

TurboMap integrates a CUDA-based Schur-complement solver (cf. Gopinath et al., ICRA 2023) to accelerate local bundle adjustment. The classic objective,

min{Ti},{Xj}i,jxijπ(Ti,Xj)2,\min_{\{T_i\},\{X_j\}} \sum_{i,j} \| x_{ij} - \pi(T_i, X_j) \|^2,

is linearized to normal equations JWJΔ=JWrJ^\top W J \Delta = -J^\top W r, partitioned between poses and points. The Schur complement solves for pose increments,

(HppHplHll1Hlp)Δp=bpHplHll1bl,(H_{pp} - H_{pl} H_{ll}^{-1} H_{lp}) \Delta_p = b_p - H_{pl} H_{ll}^{-1} b_l,

with GPU kernels handling dense-block sparse matrix operations and asynchronous result return.

4. CPU-Optimized Keyframe Culling

Traditional keyframe culling incurs significant overhead due to nested iteration over keyframes and observation lists. TurboMap introduces per-map-point counters (indexed by image pyramid level), incremented only on relevant observation or fusion events. A keyframe is marked as redundant if ≥90% of observed points have counters ≥3 at equal or finer scale. This mechanism avoids redundant looping and delivers culling speedups of 1.8×–3.3×, as measured on both desktop and embedded platforms.

5. Implementation Strategies

Coding and memory management in TurboMap reflect several key design principles:

  • Descriptor comparison uses 1D CUDA grids spanning features and neighbors; block-level shared memory stages descriptors to optimize memory access. Warp-synchronous reductions yield match results.
  • Fusion kernels are structurally analogous but leverage two-level neighbor iteration and support asynchronous map-point merging.
  • Key data—including the keypoints, descriptors, and neighbor indices—are preallocated as global GPU memory blocks, ensuring data persistency without relentless reallocation.
  • The GPU Schur-complement solver uses cuSPARSE/cuBLAS for matrix factorizations, augmented with custom CUDA routines to perform block-sparse updates.
  • CPU-side modifications to the ORB-SLAM3 pipeline replace the g2o::BlockSolver in the LBA loop with a CUDA event-waiting scheme, reducing synchronization delays.

6. Experimental Evaluation

6.1 Setup

Benchmarking was conducted on two hardware configurations: an Intel i7-12700K with NVIDIA RTX 3090 (desktop), and a Jetson Xavier NX (embedded). Datasets included EuRoC and TUM-VI, both providing stereo-inertial data; five runs per sequence were averaged with locked system frequencies.

6.2 Performance Metrics

TurboMap manifests a system-level local mapping speedup of 1.3× (EuRoC) to 1.6× (TUM-VI) on desktop, and analogous ratios on Jetson. Table 2 summarizes key metrics:

Dataset System Local Mapping Time (ms) Speedup Triangulation (ms) Speedup Fusion (ms) Speedup LBA (ms) Speedup KF Culling (ms) Speedup ATE (m)
EuRoC (desktop) ORB-SLAM3 65.7 ± 25.4 1.60 6.83 48.5 4.45 0.036
TurboMap 51.6 ± 19.1 1.3× 0.73 2.2× 4.60 1.5× 39.1 1.2× 2.42 1.8× 0.034
TUM-VI (desktop) ORB-SLAM3 109.8 ± 28.0 7.68 17.9 67.1 10.4 0.012
TurboMap 69.8 ± 15.3 1.6× 2.36 3.3× 7.63 2.3× 49.1 1.4× 3.16 3.3× 0.013

Notably, absolute trajectory error (ATE) differences between TurboMap and ORB-SLAM3 remain within a few millimeters, confirming that optimizations preserve SLAM accuracy. Under deliberate stress scenarios (frequent keyframe insertion, disabled real-time locking), TurboMap sustains LBA and culling on >90% of keyframes, with stable ATE (∼0.04 m), whereas ORB-SLAM3 skips these routines and degrades to >1 m ATE.

7. Limitations and Future Directions

Limiting factors include:

  • Data-transfer latency: Although one-shot keyframe uploads minimize overhead, non-unified memory architectures still incur non-negligible PCIe bottlenecks.
  • GPU memory footprint: Persistent storage of all keyframes can constrain scalability in large-scale mapping; dynamic eviction or compression is a potential enhancement.
  • Expansion scope: Although TurboMap targets local mapping, tracking and loop closing modules may also benefit from similar GPU acceleration with context-aware scheduling. Work on dynamic resource allocation (such as that by Semenova et al., 2024) could further improve the trade-off between runtime and mapping fidelity.

A plausible implication is that broader adoption of persistent GPU storage and kernel-based neighbor search could generalize to other modular SLAM architectures, subject to the constraints of device memory and transfer overhead.


TurboMap demonstrates that targeted GPU offloading of correspondence search, fusion, and sparse linear algebra—combined with lightweight CPU optimizations for keyframe culling—delivers substantial and consistent local mapping speedups of 1.3×–1.6× across platforms. These improvements come without loss of trajectory accuracy, even under high-frequency keyframe insertion, supporting the feasibility of real-time and embedded visual SLAM deployments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TurboMap.