Emergent Mind


Simultaneous localization and mapping is essential for position tracking and scene understanding. 3D Gaussian-based map representations enable photorealistic reconstruction and real-time rendering of scenes using multiple posed cameras. We show for the first time that using 3D Gaussians for map representation with unposed camera images and inertial measurements can enable accurate SLAM. Our method, MM3DGS, addresses the limitations of prior neural radiance field-based representations by enabling faster rendering, scale awareness, and improved trajectory tracking. Our framework enables keyframe-based mapping and tracking utilizing loss functions that incorporate relative pose transformations from pre-integrated inertial measurements, depth estimates, and measures of photometric rendering quality. We also release a multi-modal dataset, UT-MM, collected from a mobile robot equipped with a camera and an inertial measurement unit. Experimental evaluation on several scenes from the dataset shows that MM3DGS achieves 3x improvement in tracking and 5% improvement in photometric rendering quality compared to the current 3DGS SLAM state-of-the-art, while allowing real-time rendering of a high-resolution dense 3D map. Project Webpage: https://vita-group.github.io/MM3DGS-SLAM


  • Introduces a novel SLAM framework, MM3DGS, that utilizes vision, depth, and inertial inputs to enhance trajectory tracking and map rendering.

  • MM3DGS employs 3D Gaussian splatting for real-time rendering and accurate map representation, improving upon previous sparse point cloud and neural radiance field methods.

  • The system achieves significant improvements by combining photometric loss functions with depth estimates for precise localization and mapping.

  • Tested on the UT-MM dataset, MM3DGS demonstrates superior tracking accuracy and rendering quality, indicating potential across various applications.


Simultaneous Localization and Mapping (SLAM) serves as a critical component in a multitude of applications ranging from autonomous vehicle navigation to augmented reality. The choice of sensor input and map representation significantly influences the SLAM system's performance. Traditional approaches often rely on sparse visual inputs or depth data from high-cost sensors like LiDAR, potentially limiting their deployment in consumer-oriented applications. The paper introduces a novel framework for SLAM, designated as Multi-modal 3D Gaussian Splatting (MM3DGS), leveraging vision, depth, and inertial measurements. MM3DGS exhibits enhanced trajectory tracking and map rendering capabilities, enabled by the integration of inertial data and depth estimates with a 3D Gaussian map representation.

SLAM Map Representations

Existing SLAM techniques primarily utilize sparse point clouds or neural radiance fields for environmental mapping. While the former excels in tracking precision, the latter provides detailed, photorealistic reconstructions at the cost of computational efficiency. MM3DGS bridges this gap by employing 3D Gaussian splatting for real-time rendering and accurate map representation, overcoming the limitations associated with prior methods. This approach allows for scale-aware mapping, improved trajectory alignment, and efficient rendering without extensive scene-specific training.

Efficient 3D Representation and Multi-modal SLAM Frameworks

The implementation of 3D Gaussian splatting within MM3DGS demonstrates a significant advancement in utilizing explicit Gaussians for volumetric scene depiction, facilitating faster convergence and detailed scene reconstruction. The framework's ability to incorporate inertial measurements with visual and depth data addresses the common challenges posed by sensor limitations, enhancing robustness and tracking accuracy in dynamic environments.


MM3DGS integrates pose optimization, keyframe selection, Gaussian initialization, and mapping into a cohesive framework, adept at handling inputs from easily accessible and low-cost sensors. By utilizing a combination of photometric loss functions and depth estimates, the system ensures precise localization and detailed environmental mapping. Notably, the method introduces a novel approach for integrating depth supervision, utilizing depth priors for Gaussian initialization, and optimizing map fidelity based on depth correlation loss.

Experimental Setup and Results

Evaluated on the custom-created UT Multi-modal (UT-MM) dataset, MM3DGS demonstrates a 3x improvement in tracking accuracy and a 5% enhancement in rendering quality over current state-of-the-art methods. These results are underpinned by the system's capacity to efficiently process multi-modal inputs, rendering high-resolution 3D maps in real-time. The release of the UT-MM dataset, encompassing a variety of indoor scenarios, provides a vital resource for further research and benchmarking in the field.

Conclusion and Future Directions

MM3DGS represents a significant stride towards achieving robust, efficient, and scalable SLAM using multi-modal sensor data, supported by a 3D Gaussian-based map representation. The framework's superior performance in both qualitative and quantitative evaluations underscores its potential applicability across diverse domains requiring real-time localization and mapping. Future work may explore tighter integration of inertial measurements, loop closure mechanisms, and extension to outdoor environments to further enhance the system's accuracy and applicability.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

  1. A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely blind” image quality analyzer,” IEEE Signal Processing Letters, vol. 20, no. 3, pp. 209–212
  2. M. Contreras, N. P. Bhatt, and E. Hashemi, “A stereo visual odometry framework with augmented perception for dynamic urban environments,” in 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC).   IEEE, 2023, pp. 4094–4099.
  3. J. Polvi, T. Taketomi, G. Yamamoto, A. Dey, C. Sandor, and H. Kato, “SlidAR: A 3d positioning method for SLAM-based handheld augmented reality,” Computers & Graphics, vol. 55, pp. 33–43
  4. H. Bavle, P. De La Puente, J. P. How, and P. Campoy, “VPS-SLAM: Visual planar semantic SLAM for aerial robotic systems,” IEEE Access, vol. 8, pp. 60 704–60 718
  5. R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “ORB-SLAM: a versatile and accurate monocular SLAM system,” IEEE transactions on robotics, vol. 31, no. 5, pp. 1147–1163
  6. S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-based visual–inertial odometry using nonlinear optimization,” The International Journal of Robotics Research, vol. 34, no. 3, pp. 314–334
  7. E. Sucar, S. Liu, J. Ortiz, and A. J. Davison, “iMAP: Implicit mapping and positioning in real-time,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6229–6238.
  8. Z. Zhu, S. Peng, V. Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, and M. Pollefeys, “NICE-SLAM: Neural implicit scalable encoding for SLAM,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 786–12 796.
  9. H. Matsuki, R. Murai, P. H. J. Kelly, and A. J. Davison, “Gaussian Splatting SLAM,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
  10. N. Keetha, J. Karhade, K. M. Jatavallabhula, G. Yang, S. Scherer, D. Ramanan, and J. Luiten, “SplaTAM: Splat, track & map 3d gaussians for dense RGB-D SLAM,” arXiv
  11. LIV-GaussMap: LiDAR-Inertial-Visual Fusion for Real-time 3D Radiance Field Map Rendering
  12. R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: Dense tracking and mapping in real-time,” in 2011 international conference on computer vision.   IEEE, 2011, pp. 2320–2327.
  13. T. Schops, T. Sattler, and M. Pollefeys, “BAD SLAM: Bundle adjusted direct RGB-D SLAM,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 134–144.
  14. A. Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt, “Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration,” ACM Transactions on Graphics (ToG), vol. 36, no. 4, p. 1
  15. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106
  16. S. Fridovich-Keil, A. Yu, M. Tancik, Q. Chen, B. Recht, and A. Kanazawa, “Plenoxels: Radiance fields without neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5501–5510.
  17. T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Transactions on Graphics (ToG), vol. 41, no. 4, pp. 1–15
  18. B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Transactions on Graphics (ToG), vol. 42, no. 4, pp. 1–14
  19. J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in Conference on Computer Vision and Pattern Recognition (CVPR)
  20. J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm, “Pixelwise view selection for unstructured multi-view stereo,” in European Conference on Computer Vision (ECCV)
  21. COLMAP-Free 3D Gaussian Splatting
  22. C. Yan, D. Qu, D. Wang, D. Xu, Z. Wang, B. Zhao, and X. Li, “GS-SLAM: Dense visual SLAM with 3d gaussian splatting,” 2024.
  23. V. Yugay, Y. Li, T. Gevers, and M. R. Oswald, “Gaussian-SLAM: Photo-realistic dense SLAM with gaussian splatting,” 2023.
  24. J. Jeong, T. S. Yoon, and J. B. Park, “Towards a meaningful 3d map using a 3d lidar and a camera,” Sensors, vol. 18, no. 8, p. 2571
  25. C. Jiang, D. P. Paudel, Y. Fougerolle, D. Fofi, and C. Demonceaux, “Static-map and dynamic object reconstruction in outdoor scenes using 3-d motion segmentation,” IEEE Robotics and Automation Letters, vol. 1, no. 1, pp. 324–331
  26. R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,”
  27. Z. Zhu, Z. Fan, Y. Jiang, and Z. Wang, “FSGS: Real-time few-shot view synthesis using gaussian splatting,” 2023.
  28. Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612
  29. M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The EuRoC micro aerial vehicle datasets,” The International Journal of Robotics Research, 2016. [Online]. Available: http://ijr.sagepub.com/content/early/2016/01/21/0278364915620033.abstract

  30. D. Schubert, T. Goll, N. Demmel, V. Usenko, J. Stuckler, and D. Cremers, “The TUM VI benchmark for evaluating visual-inertial odometry,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, Oct. 2018. [Online]. Available: http://dx.doi.org/10.1109/IROS.2018.8593419
  31. C. Chen, P. Geneva, Y. Peng, W. Lee, and G. Huang, “Monocular visual-inertial odometry with planar regularities,” in Proc. of the IEEE International Conference on Robotics and Automation, London, UK, 2023. [Online]. Available: https://github.com/rpng/ov_plane

  32. J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of RGB-D SLAM systems,” in Proc. of the International Conference on Intelligent Robot Systems (IROS), Oct. 2012.
  33. S. Umeyama, “Least-squares estimation of transformation parameters between two point patterns,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 13, no. 04, pp. 376–380

Show All 33

Test Your Knowledge

You answered out of questions correctly.

Well done!