Sparse SLAM Point Cloud in VINS-Mono

Updated 9 April 2026

Sparse SLAM point clouds in VINS-Mono are metrically scaled 3D landmarks derived from monocular visual and inertial data, enabling efficient real-time localization and mapping.
They utilize a sliding-window optimization with feature triangulation and graph-based sparsification to retain 24–45% of landmarks while maintaining or improving trajectory accuracy.
Post-processing methods, including statistical outlier removal, voxel grid dilation, and deep-learning enhancements like DSMNet, significantly densify maps and reduce mapping errors.

A sparse SLAM (Simultaneous Localization and Mapping) point cloud in the context of VINS-Mono refers to the metrically scaled, typically non-dense 3D landmark set generated during real-time visual-inertial estimation using monocular camera data, optionally fused with IMU measurements. This representation encodes geometric scene information via a minimized set of triangulated landmarks, sufficient for trajectory optimization and relocalization, but does not directly capture fine-grained scene geometry. The following sections detail the generation, management, sparsification, post-processing, enhancement, and practical deployment of such point clouds in modern visual-inertial SLAM workflows.

1. Sparse Point Cloud Construction in VINS-Mono

VINS-Mono leverages a sliding-window, tightly coupled visual-inertial optimization to produce a sparse 3D point cloud representation of the environment. The process is structured as follows:

State Parameterization: At each keyframe, the system maintains states for pose, velocity, IMU biases, and camera-IMU extrinsics; tracked feature points are anchored by their first observation and parameterized by inverse depth $\lambda_j$ (Qin et al., 2017, Wu, 2019).
Feature Initialization/Tracking: Corner features (e.g., FAST) are detected per keyframe and tracked using optical flow (KLT). New features are initialized when active tracks fall below a threshold.
Triangulation and Optimization: Once features are observed in at least two frames, triangulation proceeds via the current sliding-window keyframe poses and inverse depths:

$f_j = R^w_{b_i}(R^b_c\,\frac{1}{\lambda_j}\,\pi_c^{-1}(u_j^{c_i}, v_j^{c_i}) + p^b_c) + p^w_{b_i}$

Subsequently, all windowed poses, velocities, biases, extrinsics, and inverse depths are jointly optimized via nonlinear least squares (MAP) with Huber robustification, fusing both IMU preintegration and visual reprojection errors.

Marginalization: To ensure bounded computational load, old states and landmarks are marginalized using the Schur complement, propagating their effect as a Gaussian prior (Qin et al., 2017, Wu, 2019).
Loop Closure: Four-DoF pose-graph optimization (over $x$ , $y$ , $z$ , yaw) aligns and corrects for drift in the global trajectory, rigidly transforming the accumulated sparse map accordingly (Qin et al., 2018).

Resultant point clouds are highly sparse—typically up to 500 landmarks per keyframe, but after marginalization and pruning, only a subset persist long-term. The stored map for relocalization and merging consists mainly of 2D keypoint positions, BRIEF descriptors, and pose-graph constraints; explicit global maps are reconstructed offline if required (Qin et al., 2018, Qin et al., 2017).

2. Management and Sparsification of Map Points

As visual-inertial SLAM evolves, memory and compute for storing and optimizing all landmarks will scale unfavorably on embedded systems. Dedicated sparsification modules mitigate this:

Pose-Visibility and Spatial Diversity: The criteria are to retain points visible in multiple frames (high $n_i$ ) and maximize spatial coverage in each image.
Graph-based Minimization: Points and frame pairs are modeled in a directed, layered graph. Min-cost max-flow optimization selects a subset $\widehat P$ based on capacity (visibility) and cost, which penalizes highly clustered projections and rewards wide baseline observations:

$C = \sum_{e \in E} f(e)\, c(e)$

Selection parameter $M$ determines the maximum allowed per frame-pair, while threshold $\theta_f$ further prunes under-constraining points (Park et al., 2022).

Integration: Sparsification is integrated as a back-end step prior to each bundle adjustment, dropping unnecessary points and accelerating optimization. Experiments show that typical configurations reduce map points to $f_j = R^w_{b_i}(R^b_c\,\frac{1}{\lambda_j}\,\pi_c^{-1}(u_j^{c_i}, v_j^{c_i}) + p^b_c) + p^w_{b_i}$ 0 of the original set with equal or improved trajectory accuracy (as measured by ATE/RMSE) (Park et al., 2022).

A plausible implication is that for resource-constrained deployments (e.g., autonomous drones), aggressive sparsification yields substantial runtime and memory gains with minimal impact on localization accuracy.

3. Post-processing: Outlier Removal and Upsampling

Sparse VINS or feature-SLAM point clouds are notably affected by reconstruction noise, outliers, and hole artefacts. Post-processing pipelines address these via:

Outlier Removal:
- Radius-based filtering: Deletes points with fewer than $f_j = R^w_{b_i}(R^b_c\,\frac{1}{\lambda_j}\,\pi_c^{-1}(u_j^{c_i}, v_j^{c_i}) + p^b_c) + p^w_{b_i}$ 1 neighbors within radius $f_j = R^w_{b_i}(R^b_c\,\frac{1}{\lambda_j}\,\pi_c^{-1}(u_j^{c_i}, v_j^{c_i}) + p^b_c) + p^w_{b_i}$ 2.
- Statistical filtering: Discards points whose mean neighbor distance $f_j = R^w_{b_i}(R^b_c\,\frac{1}{\lambda_j}\,\pi_c^{-1}(u_j^{c_i}, v_j^{c_i}) + p^b_c) + p^w_{b_i}$ 3 exceeds $f_j = R^w_{b_i}(R^b_c\,\frac{1}{\lambda_j}\,\pi_c^{-1}(u_j^{c_i}, v_j^{c_i}) + p^b_c) + p^w_{b_i}$ 4 with $f_j = R^w_{b_i}(R^b_c\,\frac{1}{\lambda_j}\,\pi_c^{-1}(u_j^{c_i}, v_j^{c_i}) + p^b_c) + p^w_{b_i}$ 5 a standard deviation multiplier (optimal $f_j = R^w_{b_i}(R^b_c\,\frac{1}{\lambda_j}\,\pi_c^{-1}(u_j^{c_i}, v_j^{c_i}) + p^b_c) + p^w_{b_i}$ 6 for tested indoor/outdoor data) (Bokovoy et al., 2018).
Upsampling (Moving-Least-Squares Class):
- Sample Local Plane: Fits tangent planes and scatters new points within radius $f_j = R^w_{b_i}(R^b_c\,\frac{1}{\lambda_j}\,\pi_c^{-1}(u_j^{c_i}, v_j^{c_i}) + p^b_c) + p^w_{b_i}$ 7; output controlled by $f_j = R^w_{b_i}(R^b_c\,\frac{1}{\lambda_j}\,\pi_c^{-1}(u_j^{c_i}, v_j^{c_i}) + p^b_c) + p^w_{b_i}$ 8, $f_j = R^w_{b_i}(R^b_c\,\frac{1}{\lambda_j}\,\pi_c^{-1}(u_j^{c_i}, v_j^{c_i}) + p^b_c) + p^w_{b_i}$ 9.
- Random Uniform Density: Fills local convex hulls up to density $x$ 0.
- Voxel Grid Dilation: Dilates occupied voxels ( $x$ 1, $x$ 2) and populates new point centroids; using $x$ 3, $x$ 4 yields best single-method performance, reducing deviation from ground truth to $x$ 5.
Pipeline: The recommended chain applies statistical outlier removal ( $x$ 6) then voxel grid dilation ( $x$ 7, $x$ 8), producing a tenfold denser map and reducing metric errors from $x$ 9 to $y$ 0 (Bokovoy et al., 2018).

These methods are implemented directly on VINS-Mono or ORB-SLAM2 outputs and are essential for generating maps usable for path planning, as shown by significant filling of scene gaps and reduction of mapping artefacts.

Step	Method/Params	Output Points ( $y$ 110m path)	Error (\%)
Raw VINS-Mono	-	12,000	2.81
Statistical out. removal	$y$ 2	6,000	2.01
Voxel dilation	$y$ 3, $y$ 4	58,000	1.75 (pipeline)

4. Deep-Learning Enhancement of Sparse Point Clouds

Traditional densification and filtering may not capture complex object surfaces or fix structural inconsistencies. Learning-based post-processing, notably DSMNet, advances this:

DSMNet Architecture:
- Density-aware Point Cloud Registration (PCR): Refines frame-to-frame cloud alignment using attention mechanisms that downweight noisy and uneven regions.
- Geometry-aware Point Cloud Sampling (PCS): Produces a uniform, high-fidelity point cloud, guided by self-attention over multi-scale neighborhoods and optimized with Chamfer distance losses.
- Cyclically alternates registration and densification, informed by adaptive point-wise significance and density maps (Qiu et al., 2023).
Integration with VINS-Mono: After extracting world-frame point clouds $y$ 5 (as described above), DSMNet processes the aggregated point set in $y$ 6 cycles, yielding a refined dense cloud ready for mesh reconstruction.
Empirical Results: DSMNet outperforms non-learning-based upsampling and registration on benchmark datasets, achieving higher F-scores and improved registration errors. On KITTI and HPMB, post-processed sparse SLAM outputs exhibit crisper surface boundaries and improved completeness—e.g., Chamfer-F increases to $y$ 7– $y$ 8 vs. $y$ 9 for MLS/BALM (Qiu et al., 2023).

A plausible implication is that data-driven enhancement layers such as DSMNet, when combined with rigorous outlier removal, significantly elevate the applicability of monocular VINS-Mono sparse maps to high-accuracy modeling use cases.

5. Workflow Integration and Computational Considerations

Sparse point cloud handling is tightly coupled with the SLAM estimator’s core logic, yet direct access for downstream tasks or post-processing requires careful management:

Data Serialization: VINS-Mono does not store global explicit 3D point clouds but only pose-graph and per-keyframe features/poses. Therefore, full triangulation across optimized keyframes must be performed offline if an explicit sparse global cloud is required (Qin et al., 2018).
Pipelining: Outlier removal and upsampling can be implemented in background or as sliding submap routines to respect real-time constraints; parallelization (e.g., k-d trees, multithreading) is necessary for larger sequences.
Resource Usage: On a typical sequence (10 m, $z$ 0Hz, $z$ 1 features per keyframe), statistical outlier removal and voxel-based upsampling complete in $z$ 2 s and $z$ 3 s respectively (Bokovoy et al., 2018). Deep learning modules (DSMNet) achieve $z$ 4 s per 4,096 points on modern GPUs (Qiu et al., 2023).
Data Formats: Inputs/outputs typically employ PCL PCD, binary $z$ 5 float buffers, and ROS camera/OctoMap messages.

6. Applications, Limitations, and Current Research

Sparse SLAM point clouds constructed by VINS-Mono and enhanced by post-processing have proven highly valuable in:

Navigation and Planning: Post-processed, dense voxel maps (e.g., OctoMap) exhibit drastically reduced artefacts, supporting reliable path planning and collision checking (Bokovoy et al., 2018).
Mapping Accuracy: Aggressive sparsification and upsampling, when tuned appropriately, do not degrade localization accuracy and in some cases modestly improve it, as shown by extensive evaluations on EuRoC, TUM, KITTI, and Málaga datasets (Park et al., 2022, Bokovoy et al., 2018, Qiu et al., 2023).
Map Reuse and Merging: VINS-Mono supports efficient pose-graph-based relocalization and merging of sequence maps, ensuring a globally consistent sparse representation without explicit point averaging (Qin et al., 2018).

However, VINS-Mono’s sparse cloud lacks surface structure by default and demands significant post-processing for use in semantics-aware or manipulation tasks. Performance for surface modeling depends on point cloud density and noise characteristics, motivating continuing research into learning-based upsampling and adaptive sparsification.

7. Summary Table: Processing Methods and Performance

Method	Purpose	Notable Results/Advantages
Statistical Outlier Removal (Bokovoy et al., 2018)	Remove noise, outliers	Reduces error, halves point count
Voxel Grid Dilation (Bokovoy et al., 2018)	MLS upsampling	Fills gaps, increases density 10×
Graph-Based Sparsification (Park et al., 2022)	Min. map points, max. BA	24–45% points, ATE unchanged/better
DSMNet Deep Post-process. (Qiu et al., 2023)	Registration, surface dens.	Best-in-class completeness, accuracy

All numerical claims, algorithmic steps, and system behaviors appear explicitly in the sources cited. Together, these components define the state-of-the-art workflow for constructing, refining, and exploiting sparse SLAM point clouds in the VINS-Mono ecosystem.