Dropping the D: RGB-D SLAM Without the Depth Sensor (2510.06216v1)
Abstract: We present DropD-SLAM, a real-time monocular SLAM system that achieves RGB-D-level accuracy without relying on depth sensors. The system replaces active depth input with three pretrained vision modules: a monocular metric depth estimator, a learned keypoint detector, and an instance segmentation network. Dynamic objects are suppressed using dilated instance masks, while static keypoints are assigned predicted depth values and backprojected into 3D to form metrically scaled features. These are processed by an unmodified RGB-D SLAM back end for tracking and mapping. On the TUM RGB-D benchmark, DropD-SLAM attains 7.4 cm mean ATE on static sequences and 1.8 cm on dynamic sequences, matching or surpassing state-of-the-art RGB-D methods while operating at 22 FPS on a single GPU. These results suggest that modern pretrained vision models can replace active depth sensors as reliable, real-time sources of metric scale, marking a step toward simpler and more cost-effective SLAM systems.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (in simple terms)
This paper shows how a robot (or phone, drone, etc.) can figure out where it is and build a map of its surroundings using just a regular camera, without needing a special depth sensor. This task is called SLAM, which stands for āSimultaneous Localization and Mapping.ā The authorsā system, called DropD-SLAM, uses smart AI tools to guess distances from a single picture and to ignore moving things, so it can work almost as well as systems that use extra depth hardware.
The main questions the paper asks
- Can a single normal camera, plus modern AI, do SLAM as accurately as systems that use depth sensors?
- How can we make SLAM stable when people or other objects are moving in the scene?
- Which matters more for good results: super-accurate depth in each frame, or depth that stays consistent over time?
How the method works (with easy analogies)
Think of the camera as your eyes walking through a room. Traditional RGB-D systems have āeyesā plus a ārulerā (a depth sensor) to measure how far things are. DropD-SLAM drops the ārulerā and uses three ready-made AI āhelpersā to replace it:
- Depth helper: Like a very experienced friend who can look at a single photo and guess how far away things are (predicts metric depth from one image).
- Feature helper: Spotter that picks out strong, repeatable landmarks in the image (keypoints)āthink of stickers placed on corners and edges so you can recognize them again later.
- Object helper: A filter that finds moving things like people and cars and marks them so they can be ignored (instance segmentation), a bit like putting ādo not useā tape on items that wonāt stay still.
Hereās the basic step-by-step idea:
- The camera takes a picture.
- The object helper labels moving objects (like people). The system slightly grows these labels to be extra safe at the edges.
- The feature helper finds stable landmarks only in the parts of the image that arenāt moving.
- The depth helper estimates how far those landmarks are from the camera.
- Using simple camera math, the system turns each 2D landmark into a 3D point with a real-world scale (imagine placing tiny flags in 3D space where each landmark should be).
- These 3D points go into a standard, off-the-shelf SLAM program (the same one used for systems with real depth sensors). That program handles tracking the cameraās motion, building the map, and closing loops when you return to a place youāve seen before.
Important detail: They donāt rewrite the classic SLAM āback end.ā They just feed it clean, scaled 3D points, so it works like normalāonly now it doesnāt need a depth sensor.
What they found and why it matters
- Strong accuracy without a depth sensor:
- On a common test set (TUM RGB-D), DropD-SLAM achieved around 7.4 cm average error on static scenes and about 1.8 cm on dynamic scenes with lots of motion. Thatās roughly the size of a small toy car or a big coināvery precise for just a single camera.
- It runs in real time at about 22 frames per second on one GPU.
- Ignoring moving things is crucial:
- If they donāt filter out people and other movers, the system quickly gets confused and makes large errors. With filtering, it stays stable.
- Consistency beats single-frame perfection:
- Depth estimates that are steady over time help more than depth that is super-accurate in just one frame but jumps around from frame to frame. In other words, a āpretty good but steadyā ruler is better than a āperfect but twitchyā one.
- Learned features help in tough conditions:
- The special keypoint detector (the feature helper) finds more reliable landmarks when thereās motion blur or not much texture, making tracking more robust.
- Simple and compatible:
- Because they keep the classic SLAM back end unchanged, the system is easy to plug into existing robotics software. Also, as depth and object-detection AIs improve, you can swap them in to get better results without redesigning everything.
Why this is important
- Fewer sensors, lower cost: You can get high-quality 3D tracking and mapping using only a regular camera, saving money, power, and complexity compared to systems that need depth sensors or LiDAR.
- Works well around people: By automatically ignoring moving objects, the system stays reliable in real-world places like homes, offices, or shops.
- Future-proof: As AI tools for depth and object detection get better, this approach will likely get even more accurateāwithout changing the SLAM core.
Limitations and whatās next
- Tricky scenes: Shiny, transparent, or outdoor scenes with unusual lighting can still confuse the AI depth and object detectors.
- Very crowded spaces: If almost everything is moving or thereās not enough static background, tracking can still be hard.
- No deep āfixingā of depth: The system treats the AI-predicted depths mostly as they are, so any bias in those predictions can affect the map.
In the future, better AI models and simple ways to handle uncertainty could make this even more reliable outdoors and in complex, reflective environments.
Bottom line
DropD-SLAM shows that smart, pretrained AI tools can replace a depth sensor for SLAM. With just one camera, it achieves accuracy close to systems that use extra hardware, handles moving objects well, and runs in real time. This points toward simpler, cheaper, and more flexible robots, AR devices, and drones that still understand 3D space very well.
Knowledge Gaps
Knowledge Gaps, Limitations, and Open Questions
Below is a single, consolidated list of concrete gaps and open questions that remain unresolved and could guide future research:
- Generalization beyond indoor TUM RGB-D: no evaluation on outdoor, large-scale, or diverse indoor/outdoor benchmarks (e.g., KITTI/Raw+Odometry, EuRoC, 7-Scenes, ScanNet, Replica), where illumination, texture sparsity, and depth range differ significantly.
- Robustness to domain shift: depth and segmentation modelsā reliability under strong sunlight, HDR scenes, low light, weather, reflective/transparent materials, and novel object categories remains unquantified and unmanaged.
- Predominantly dynamic scenes: the method lacks a strategy for cases with few or no static anchors (crowds, moving platforms), where semantic filtering may remove most features and cause tracking loss.
- Depth uncertainty modeling: predicted depths are treated as fixed with a uniform variance Ļd2; there is no per-pixel, depth-dependent, or heteroscedastic uncertainty calibration, nor a way to propagate uncertainty into BA/PnP robustly.
- Systematic bias correction: no online mechanism to estimate and correct sequence-specific scale/shape biases in the depth predictions (e.g., a global or local scale correction factor refined by multi-view constraints).
- Temporal consistency enforcement: although shown to be critical, the pipeline does not explicitly regularize or smooth depth temporally (e.g., via video depth models, optical-flow-guided filtering, or temporal consistency losses).
- Hyperparameter sensitivity: range clipping thresholds, mask dilation radius, number of features, and confidence cutoffs require per-model tuning; no automatic or adaptive selection methods are provided.
- Segmentation reliability and failure handling: the approach assumes accurate semantic masks; it does not fuse motion cues (flow/epipolar violations) to catch missed movers or reduce over-masking of static structures.
- Segmentation model choice: only YOLOv11 is explored; comparative evaluation across instance/semantic segmentation models, mask quality metrics, and real-time trade-offs is missing.
- Rolling shutter and calibration effects: robustness to rolling-shutter distortions, inaccurate intrinsics/distortion, and auto-exposure/white-balance changes is not studied.
- IMU and sensor fusion: ORB-SLAM3 supports inertials, but this system omits them; it is unclear how lightweight IMU fusion could improve robustness during rapid motion, low texture, or depth model failures.
- Loop closure and place recognition: reliance on ORB+BoW is maintained; the impact of learned global descriptors or hybrid retrieval on long-term, large-loop scenarios and viewpoint/illumination changes is unexplored.
- Map quality evaluation: the work focuses on trajectory ATE; there is no assessment of map accuracy, completeness, consistency over time, or static/dynamic separation quality in the reconstructed 3D structure.
- Dense reconstruction: the method generates sparse maps; it remains open how to recover dense geometry (e.g., via learned densification or TSDF/fusion with predicted depth) while keeping real-time performance.
- Multi-body/dynamic-object modeling: dynamic content is only suppressed; there is no tracking or mapping of moving objects, nor evaluation of benefits from joint static-dynamic SLAM or scene-flow integration.
- Back-end compatibility vs. optimality: while the back end is unmodified, it is unclear whether SLAM optimization tailored to noisy, biased monocular depths (e.g., robust priors, bias terms, depth regularizers) would yield further gains.
- Real-world deployment constraints: results depend on an RTX 4090; power, latency, and thermals on embedded platforms (Jetson, XR2) and CPU-only feasibility are not quantified.
- Latency and synchronization: the parallel GPU front end may introduce variable latency; the impact of asynchronous module outputs on tracking stability at higher frame rates is unreported.
- Failure mode analysis: a systematic taxonomy of failures (e.g., segmentation misses, depth collapse, motion blur spikes) and recovery strategies (relocalization, dynamic reweighting, fallback heuristics) is absent.
- Feature-detector/descriptor pairing: Key.Net is paired with ORB descriptors; the potential benefits and cost of modern learned descriptors and matchers (e.g., SuperPoint/SuperGlue, DISK/LightGlue) are not evaluated.
- Feature budget scaling: the optimal number of features varies across depth models and scenes; a principled, online mechanism to adapt feature count based on scene observability and depth stability is missing.
- Depth scale across cameras: while some depth models encode intrinsics, the stability of metric scale across different lenses, focal lengths, and camerasāand the need for per-camera calibrationāremain open.
- Mask dilation tuning: dilation mitigates boundary errors but may remove valuable static features; there is no analysis of dilation-size trade-offs or adaptive per-instance/per-frame dilation.
- Long-term consistency: performance over very long trajectories with repeated revisits, map growth/maintenance, and multi-session mapping (including map reuse and drift over hours/days) is not addressed.
- Uncertainty-aware outlier rejection: PnP/RANSAC and BA do not leverage depth confidences from the predictor (if available) or learned confidence estimation; depth-aware data association and weighting remain unexplored.
- Resolution and input scaling: the effect of input image/depth/segmentation resolution on accuracy, runtime, and memory (and potential dynamic resolution scheduling) is not studied.
- OOD detection and safeguards: there is no mechanism to detect when depth or segmentation becomes unreliable and to trigger safe fallbacks (e.g., switch to pure monocular VO, increase robustness thresholds).
- Non-rigid backgrounds: semantic filtering may not catch non-rigid but āstatic-labeledā content (curtains, plants); combining semantic, geometric, and temporal deformation cues is an open direction.
- Comparative cost-benefit: the claim that learned models can replace depth sensors lacks a quantified analysis of TCO (power, compute, model size) across platforms and conditions where active depth remains superior (e.g., low-light, textureless scenes).
- Benchmark breadth: results on more dynamic datasets with varied motion (e.g., handheld with sharp rotations, fast platforms) and different intrinsics/FOV would strengthen claims of generality.
- Online self-calibration: automatic estimation of depth model scale/offset per sequence or per keyframe using multi-view constraints, without ground truth, is not attempted.
- Per-scene adaptive priors: the pipeline is zero-shot; opportunities for lightweight test-time adaptation (e.g., depth bias correction, segmentation thresholding) to stabilize hard scenes are not explored.
Practical Applications
Immediate Applications
The following applications can leverage the paperās findings and modular pipeline today, with minimal adaptation. They map to multiple sectors and specify likely tools, products, or workflows, along with feasibility assumptions.
- Monocular SLAM retrofit for indoor mobile robots
- Sector: robotics, logistics, manufacturing
- What: Replace RGB-D or LiDAR with a single RGB camera using a DropD-style front end (monocular metric depth + learned keypoints + instance segmentation) feeding an unmodified RGB-D SLAM back end (e.g., ORB-SLAM3).
- Tools/products/workflows: ROS package integrating DepthAnythingV2/UniDepthV2 + YOLOv11 + Key.Net; deployment scripts for camera intrinsics; mask dilation configuration for people/vehicles.
- Assumptions/dependencies: Indoor lighting with enough static structure; GPU or strong edge compute; reliable camera intrinsics; segmentation classes cover common dynamic objects (e.g., people); real-time performance meets use-case latency requirements.
- Cost-down redesign of warehouse AGVs and service robots
- Sector: robotics, supply chain, retail, hospitality, healthcare logistics
- What: Remove depth hardware to cut bill of materials, power, and complexity while maintaining metric-scale localization and robustness to dynamic scenes (people, carts).
- Tools/products/workflows: DropD-SLAM SDK integrated into robot navigation stacks; tuning of depth clipping thresholds and mask dilation; confidence weighting of depth priors.
- Assumptions/dependencies: Sufficient compute per robot (e.g., embedded GPU); camera placement avoiding glare/reflections; periodic validation against fiducials for scale sanity checks.
- Smartphone-based room scanning and floor planning
- Sector: software, real estate, interior design, insurance
- What: Mobile app to capture metrically scaled 3D room models and trajectories using only RGB video, suppressing moving people via instance masks.
- Tools/products/workflows: iOS/Android app with on-device or edge inference; export to CAD/IFC/GLTF; workflow for insurance claims, property measurement, and virtual staging.
- Assumptions/dependencies: Accurate intrinsics (from EXIF/device profiles); enough static geometry; privacy-aware masking; potentially cloud inference to meet 20+ FPS where mobile SoCs are insufficient.
- AR anchoring and spatial computing without depth sensors
- Sector: AR/VR/MR, education, retail
- What: Use DropD-style features to stabilize 6-DoF tracking and persistent anchors in dynamic indoor spaces (stores, classrooms) with a single RGB camera.
- Tools/products/workflows: AR SDK plugin enabling metric scale and dynamic-object filtering; workflows for in-store navigation, interactive exhibits, and classroom spatial content.
- Assumptions/dependencies: On-device inference speed; good illumination; suppression of moving crowds via masks; device-specific calibration for consistent scale.
- Indoor drone mapping for inventory and inspection
- Sector: robotics, industrial inspection, retail
- What: Small drones perform metric-scale SLAM and mapping with only RGB; dynamic masking reduces pose corruption from moving staff.
- Tools/products/workflows: Flight control integration; DropD-SLAM ROS node; preflight auto-tuning of depth clips; loop-closure workflows for map consolidation.
- Assumptions/dependencies: Stabilized flight; adequate texture and lighting; compute budget on edge; mask coverage of dynamic classes.
- Digital twin creation from monocular video
- Sector: construction tech, facility management, smart buildings
- What: Generate metrically scaled indoor maps and point clouds from handheld or robot-captured RGB video, avoiding RGB-D sensor limitations on reflective/translucent surfaces.
- Tools/products/workflows: Capture app + mapping pipeline; export to BIM tools; periodic loop-closure runs; QA using known dimensions.
- Assumptions/dependencies: Consistent intrinsics; occasional manual anchors for global scale sanity; indoor domain; compute/latency constraints met by desktop GPU or cloud.
- Academic tooling: benchmarking and model selection for SLAM stability
- Sector: academia, research
- What: Use the pipeline to evaluate how temporal consistency of depth (vs. per-frame accuracy) affects SLAM error; drive model choice and parameter tuning (e.g., depth clipping, feature budgets).
- Tools/products/workflows: Open-source evaluation suite; metrics for CV of scale/accuracy over time; ablation scripts; reproducible configs for TUM RGB-D and similar datasets.
- Assumptions/dependencies: Access to multiple depth models; standardized datasets; reproducible compute environment.
- Drop-in front end for existing RGB-D back ends
- Sector: software, robotics platforms
- What: Provide a āvirtual depthā front end that yields metrically scaled 3D features into unmodified RGB-D SLAM systems, preserving compatibility with existing mapping/tracking stacks.
- Tools/products/workflows: Library/ROS node that outputs PnP-ready 3D points + descriptors; configurable uncertainty (variance) on depth priors; diagnostics for mask coverage and feature distribution.
- Assumptions/dependencies: Integration with ORB-SLAM3 or similar; tuning of depth prior variance; mask dilation and feature budgets tuned for scene dynamics.
- Safer robot-human cohabitation via dynamic-object suppression
- Sector: robotics, workplace safety
- What: Use instance segmentation with dilation to filter out human-related features, reducing tracking failures and erratic robot behavior in crowded environments.
- Tools/products/workflows: Standardized ādynamic class listā policies (people, carts, forklifts); runtime monitoring of dynamic feature ratios; safety audits combining localization logs with segmentation outputs.
- Assumptions/dependencies: Segmentation model accuracy on in-domain classes; compliance with privacy policies; consistent illumination; fallback behaviors if static scene is insufficient.
Long-Term Applications
The following applications are promising but need further research, scaling, or hardening (e.g., uncertainty-aware optimization, outdoor generalization, certification). They identify sectors, possible products/workflows, and key feasibility factors.
- Outdoor monocular SLAM across adverse conditions
- Sector: robotics, autonomous vehicles, smart city
- What: Robust single-camera SLAM in challenging outdoor scenes (variable lighting, weather, reflections), replacing or complementing LiDAR/RGB-D.
- Tools/products/workflows: Outdoor-trained depth and segmentation models; domain adaptation; active uncertainty modeling; sensor fusion fallback.
- Assumptions/dependencies: Improved generalization and temporal stability outdoors; explicit uncertainty in optimization; regulatory acceptance for navigation.
- Safety-critical AR and surgical/navigation support
- Sector: healthcare, industrial AR
- What: Metric-scale monocular SLAM for surgical guidance or factory precision tasks where errors are unacceptable.
- Tools/products/workflows: Certified pipeline with uncertainty bounds; runtime health monitoring of scale drift; redundant sensing (IMU, fiducials).
- Assumptions/dependencies: Formal verification and certification; enhanced uncertainty-aware back ends; extensive clinical/industrial validation.
- Collaborative, crowdsourced indoor mapping at city scale
- Sector: smart buildings, public safety, urban planning
- What: Aggregate monocular maps from many devices to maintain up-to-date digital twins of public buildings (malls, transit hubs).
- Tools/products/workflows: Federated mapping platform; privacy-preserving instance masking; map merging with loop closures; governance for data ownership.
- Assumptions/dependencies: Privacy/legal frameworks; scalable map reconciliation; robust cross-device intrinsics handling; standardized data formats.
- Low-power edge deployment on consumer devices and micro-robots
- Sector: consumer electronics, robotics, wearables
- What: Run DropD-like pipelines on phones, AR glasses, micro-robots with tight energy budgets.
- Tools/products/workflows: Model compression/distillation; hardware accelerators (NPU/DSP); adaptive frame-rate and feature budgets; on-device auto-tuning of depth clips.
- Assumptions/dependencies: Efficient models with strong temporal stability; hardware acceleration; acceptable latency and battery impact.
- Joint optimization and uncertainty-aware SLAM back ends
- Sector: robotics/software
- What: Replace fixed depth priors with uncertainty-aware, jointly optimized depth/pose frameworks to correct systematic biases and improve global consistency.
- Tools/products/workflows: Dense or semi-dense bundle adjustment with depth uncertainty; scale regularization; online depth refinement with temporal priors.
- Assumptions/dependencies: Robust optimization that preserves real-time performance; reliable uncertainty estimation from depth networks; stable feature tracking under refinement.
- Multi-agent monocular SLAM and map sharing
- Sector: robotics, logistics, emergency response
- What: Teams of robots or wearables build and share metrically consistent maps without depth sensors, operating in dynamic spaces.
- Tools/products/workflows: Distributed pose graph optimization; map versioning and conflict resolution; semantic-aware dynamic suppression across agents.
- Assumptions/dependencies: Reliable inter-agent communication; standardized descriptors and map formats; dynamic-class harmonization.
- Regulatory and procurement policies for cost- and energy-efficient spatial perception
- Sector: policy, public sector, enterprise IT procurement
- What: Guidelines that recognize monocular AI-based SLAM as a viable alternative to depth sensors for many indoor tasks, reducing cost and power consumption.
- Tools/products/workflows: Evaluation standards emphasizing temporal consistency; recommended dynamic-class lists; compliance checklists.
- Assumptions/dependencies: Broad validation across domains; clear safety envelopes; transparency around data processing (segmentation, depth inference).
- Smart home and consumer robotics: next-gen mapping
- Sector: consumer robotics, smart home
- What: Robot vacuums, assistants, and AR experiences using robust monocular mapping with dynamic masking (ignoring pets/people).
- Tools/products/workflows: Embedded pipeline optimized for home lighting; user-friendly calibration; periodic map maintenance with loop closure.
- Assumptions/dependencies: High segmentation accuracy for household objects; reliable performance on low-cost hardware; resilience to mirrors and glass.
- Fallback localization for autonomous vehicles and heavy machinery
- Sector: transportation, construction
- What: Monocular SLAM as a fallback when LiDAR/Depth sensor performance degrades (e.g., rain, dust, glare).
- Tools/products/workflows: Sensor fusion modes that elevate monocular SLAM under failure conditions; health monitoring; conservative motion planning when operating on monocular input.
- Assumptions/dependencies: Outdoor robustness; tight integration with IMU, wheel odometry; certification for fallback behavior.
- Construction progress tracking and compliance at scale
- Sector: construction tech
- What: Routine monocular captures to produce metrically scaled site updates and detect deviations against BIM, without special sensors.
- Tools/products/workflows: Scheduled capture workflows; automated alignment to BIM; change detection pipelines; QA dashboards.
- Assumptions/dependencies: Handling of large, partially reflective spaces; improved segmentation of construction equipment; domain-adapted depth models.
Glossary
- Ablation studies: Systematic experiments removing or altering components to assess their impact. "Through ablation studies we identify temporal depth consistency rather than per-frame accuracy as the dominant factor for monocular SLAM performance,"
- Absolute Trajectory Error (ATE): A metric that measures the global pose error between an estimated and ground-truth trajectory. "7.4\,cm mean ATE on static sequences"
- AbsRel (Absolute Relative error): A depth evaluation metric measuring relative error between predicted and ground-truth depths. "AbsRel CV preserves relative depth ratios."
- Active sensing modalities: Sensors that emit energy and measure its return (e.g., RGB-D, LiDAR) to obtain depth directly. "Active sensing modalities such as RGB-D cameras and LiDAR provide metrically scaled depth and improved robustness to scene dynamics."
- Back end (SLAM back end): The optimization component of SLAM responsible for tracking, mapping, and loop closure. "These are processed by an unmodified RGB-D SLAM back end for tracking and mapping."
- Backprojection: Converting 2D pixels with depth into 3D points using camera intrinsics. "backprojected into 3D to form metrically scaled features."
- Bag-of-words retrieval: A place-recognition technique that uses quantized visual features to find loop closures. "Loop closures are identified via bag-of-words retrieval over ORB descriptors"
- Bundle adjustment: Joint optimization of camera poses and 3D points to minimize reprojection error. "global pose graph optimization and bundle adjustment to refine the trajectory and structure."
- Camera intrinsics matrix: Parameters describing the cameraās internal geometry (focal lengths, principal point). "given the camera intrinsics matrix ,"
- Coefficient of variation (CV): A normalized measure of dispersion (standard deviation divided by mean) used here to assess temporal stability. "their coefficient of variation (CV) across frames quantifies stability:"
- CUDA streams: Parallel execution contexts on NVIDIA GPUs enabling concurrent kernel launches. "Vision modules execute in parallel on CUDA streams, while the SLAM back end runs on the CPU."
- Direct methods: SLAM approaches that optimize photometric consistency over image intensities rather than sparse features. "Direct and semi-direct methods such as LSD-SLAM~\cite{engel2014lsd}, DSO~\cite{engel2017direct}, and SVO~\cite{forster2014svo} minimize photometric error over pixel intensities,"
- Dynamic object filtering: Removing features on moving objects to uphold static-scene assumptions in SLAM. "We introduce a dynamic object filtering strategy based on instance-level segmentation with morphological dilation"
- End-to-end optimization: Training or optimizing the entire SLAM pipeline jointly with learned components. "Unlike prior learned SLAM approaches that rely on end-to-end optimization, scene-specific adaptation, or custom back ends,"
- Feature-based pipelines: SLAM methods that detect and match sparse keypoints across frames. "Feature-based pipelines, exemplified by the ORB-SLAM series~\cite{mur2015orb, mur2017orb, campos2021orb}, rely on sparse keypoint detection and matching."
- Feature budget: The chosen number of features per frame used for tracking and mapping. "The feature budget controls the number of keypoints retained per frame."
- Gaussian splatting: A scene representation using 3D Gaussian primitives for rendering/optimization. "representations like Gaussian splatting~\cite{matsuki2024gaussian, sandstrom2025splat}"
- Geometric back end: A SLAM back end that relies on geometric optimization rather than learned components. "with an unmodified geometric back end."
- Instance segmentation: Pixel-level detection of individual object instances with class labels. "instance segmentation networks such as YOLOv11~\cite{khanam2024yolov11} provide efficient localization of dynamic objects."
- Intrinsics encoding: Incorporating camera intrinsic parameters into model inputs or architectures. "by leveraging large-scale training and explicit intrinsics encoding."
- Landmark initialization: The process of creating new 3D map points from image measurements. "monocular pipelines suffer from scale drift and unstable landmark initialization,"
- Learned depth priors: Depth cues provided by trained models that inform geometric estimation. "where learned depth priors increasingly rival dedicated hardware sensors,"
- Learned keypoints: Neural networkādetected feature points optimized for robustness and repeatability. "we show that learned keypoints improve robustness under motion blur and texture-poor conditions."
- LiDAR: A laser-based active ranging sensor producing depth measurements. "Active sensing modalities such as RGB-D cameras and LiDAR provide metrically scaled depth"
- Loop closure: Detecting revisited places to correct accumulated drift in the map and trajectory. "for tracking, mapping, and loop closure in real time."
- MAE (Mean Absolute Error): A depth evaluation metric averaging absolute differences between predicted and true depths. "MAE CV stabilizes residuals for outlier rejection,"
- Map points: 3D landmarks maintained by the SLAM map for localization and structure. "Mapping proceeds by instantiating new 3D map points when parallax and visibility conditions are met."
- Metric scale: Absolute scaling of the reconstruction so distances correspond to real-world units. "which allows metric scale without the need for depth sensors."
- Monocular depth estimation: Predicting scene depth from a single RGB image. "a monocular depth estimator such as DepthAnythingV2~\cite{yang2024depth} produces a dense depth map"
- Morphological dilation: Expanding binary masks to cover uncertain boundaries and small gaps. "instance-level segmentation with morphological dilation"
- Multi-view constraints: Geometric constraints arising from observing the same 3D points across multiple views. "refine structure via multi-view constraintsāwithout requiring any modification to the back end."
- Oracle scaling: Using ground-truth information to rescale predictions for best-case performance analysis. "per-frame oracle scaling with ground-truth depth,"
- ORB descriptor: A binary feature descriptor used for efficient matching in SLAM. "a 256-bit ORB descriptor~\cite{rublee2011orb},"
- Parallax: Apparent motion of scene points due to camera movement, enabling triangulation and mapping. "when parallax and visibility conditions are met."
- Photometric error: The intensity difference used as an optimization objective in direct SLAM methods. "minimize photometric error over pixel intensities,"
- Pinhole model: A camera projection model that maps 3D points to 2D pixels using perspective geometry. "using the pinhole model:"
- Perspective-n-Point (PnP): Estimating camera pose from 2Dā3D correspondences. "Camera tracking is formulated as a Perspective-n-Point (PnP) problem within a RANSAC loop,"
- Pose graph optimization: Global optimization over camera poses connected by constraints (edges) to reduce drift. "global pose graph optimization and bundle adjustment"
- Pretrained vision models: Models trained on large datasets and used without task-specific fine-tuning. "These results suggest that modern pretrained vision models can replace active depth sensors"
- RANSAC: A robust estimation method that iteratively fits models while rejecting outliers. "within a RANSAC loop,"
- RMSE (Root-Mean-Square Error): A metric measuring the square root of mean squared errors, used for depth and trajectory evaluation. "root-mean-square error (RMSE) of Absolute Trajectory Error (ATE) in meters"
- RGB-D: RGB images paired with per-pixel depth measurements. "These are passed to an unmodified RGB-D SLAM back end"
- Scale ambiguity: The inability of monocular systems to determine absolute scale without additional cues. "Monocular SLAM remains attractive ... yet it continues to face two persistent limitations: scale ambiguity and sensitivity to dynamic environments."
- Scale drift: Gradual change in scale over time in monocular SLAM due to lacking metric constraints. "monocular pipelines suffer from scale drift and unstable landmark initialization,"
- SE(3): The Lie group of 3D rigid-body transformations (rotation and translation). "allow direct SE(3) pose estimation without additional calibration,"
- Semantic filtering: Using semantic labels (e.g., object classes) to exclude unreliable regions. "confirming that semantic filtering is indispensable in dynamic environments."
- Semi-direct methods: SLAM approaches blending direct photometric tracking with sparse feature selection. "Direct and semi-direct methods such as LSD-SLAM~\cite{engel2014lsd}, DSO~\cite{engel2017direct}, and SVO~\cite{forster2014svo}"
- Static-world assumption: The modeling assumption that the scene is static, enabling consistent feature matching. "dynamic objects violate the static-world assumption underlying most SLAM formulations,"
- Structuring element: The morphological kernel used to dilate or erode masks. "a circular structuring element"
- Structured-light depth sensor: A depth sensor projecting patterns to infer depth, often noisy on reflective/textureless surfaces. "the structured-light depth sensor used in TUM"
- Temporal consistency: Stability of predictions over time, crucial to reduce drift in sequential estimation. "Temporal consistency analysis of monocular depth estimators on fr3_walking_xyz."
- Unary residual: A single-variable error term in optimization, here encoding depth priors on points. "Each depth introduces a unary residual with variance in bundle adjustment,"
- Uncertainty modeling: Representing and propagating the confidence of measurements within optimization. "Depth is currently treated as a fixed observation without explicit uncertainty modeling,"
- Zero-shot: Deploying a model/system in new environments without additional training or fine-tuning. "enabling zero-shot deployment across diverse environments."
Collections
Sign up for free to add this paper to one or more collections.
