DSO: Direct Sparse Odometry Overview

Updated 12 May 2026

Direct Sparse Odometry (DSO) is a visual odometry framework that directly minimizes photometric error over sparse, high-gradient image patches for accurate pose and depth estimation.
It employs a fixed-lag sliding window and joint optimization of camera poses, inverse depths, and photometric parameters to ensure temporal consistency.
Efficient marginalization, robust calibration, and gradient-aware pixel sampling enable DSO to achieve real-time performance on standard hardware.

Direct Sparse Odometry (DSO) is a direct, keyframe-based visual odometry framework that formulates pose and 3D structure estimation as a nonlinear, windowed photometric bundle adjustment problem over a sparse set of image points with significant intensity gradient. Unlike indirect methods reliant on feature extraction and matching, DSO jointly optimizes camera intrinsics, affine brightness transfer, camera poses, and per-point inverse depths by directly minimizing the photometric error measured over small image patches. Through robust photometric calibration, gradient-aware pixel sampling, a fixed-lag sliding window, and efficient marginalization, DSO achieves high accuracy and real-time performance on contemporary multi-core CPUs, and forms the substrate of a large ecosystem of direct, sparse visual SLAM, visual-inertial odometry, and related systems.

1. Formulation and Core Pipeline

The core DSO objective is a robust, gradient-weighted photometric consistency cost evaluated over a sliding window of $N_f$ active keyframes and $N_p$ sparse points per frame. For a pixel $p$ in reference frame $i$ with inverse depth $d_p$ , the photometric error on a target frame $j$ is

$E_{pj} = \sum_{q \in \mathcal{N}_p} w_q \left\|\, \bigl(I_j[q'] - b_j\bigr) - \frac{t_j e^{a_j}}{t_i e^{a_i}}(I_i[q] - b_i) \right\|_\gamma,$

where $q$ runs over a small stencil around $p$ , $I_i,I_j$ are photometrically corrected intensities, $N_p$ 0 are exposure times, $N_p$ 1 are affine brightness parameters, $N_p$ 2 down-weights low-contrast pixels, and $N_p$ 3 is the Huber norm. The total windowed cost is

$N_p$ 4

with automatic marginalization of the oldest keyframes/points to keep the system bounded in complexity. Joint Gauss–Newton optimization solves for all active poses (on $N_p$ 5), inverse depths, affine exposure parameters, and optionally intrinsics. The pixel sampling scheme selects high-gradient pixels in a spatially uniform manner, addressing the degeneracy of selecting only corners and ensuring robustness even in low-texture scenes (Engel et al., 2016).

Key Properties

Direct minimization of photometric error increases accuracy, leverages subpixel alignment, and is not reliant on keypoint repeatability.
Sparse selection supports real-time operation by capping memory and computational cost while retaining spatial coverage and high-informative points.
Joint windowed bundle adjustment ensures temporal consistency and enables rapid convergence as new views arrive.
Affine brightness modeling and full photometric calibration increase robustness to exposure change and camera non-linearities.

2. Algorithmic Advances and System Extensions

A range of extensions has been developed on top of the DSO core, enhancing data association, accuracy, robustness to real-world effects, or application domain.

a) Loop Closure and Relocalization

LDSO: Introduces a feature-based loop detection pipeline by biasing point selection toward Shi–Tomasi corners, computing ORB descriptors on these repeatable points, and integrating a bag-of-words appearance database for fast candidate retrieval. Loop constraints are formulated in Sim(3) via pose-graph optimization after geometric verification, reducing long-term drift (Gao et al., 2018).
Tight Integration of Feature-based Relocalization: Further tightens map-based global pose priors by adding relative pose constraints to both front-end direct alignment and back-end sliding window BA, with online pose-graph fusion of global and local information for improved drift correction (Gladkova et al., 2021).

b) Deep Learning and Semantic Integration

SalientDSO: Incorporates visual saliency prediction (via SalGAN) and semantic scene parsing (via PSPNet) to bias point selection toward human-attentive, semantically informative regions. The core photometric pipeline is unmodified, but the spatial distribution and informativeness of the sampled points improve robustness, especially in cluttered or low-feature scenes. SalientDSO achieves ATE reductions versus vanilla DSO and even outperforms ORB-SLAM on challenging indoor datasets under sparse settings (Liang et al., 2018).
Deep Direct Visual Odometry (DDSO): Integrates a convolutional neural network pose prior (TrajNet), trained with improved geometric losses for scale consistency, as an initial pose hypothesis for DSO. This robustifies initialization and tracking, yielding lower translation/rotation errors and increased initialization rate on KITTI benchmarks (Zhao et al., 2019).

c) Robustness to Real-World Sensors and Environments

Omnidirectional DSO: Generalizes DSO to wide FOV fisheye cameras using the unified omnidirectional model, allowing full use of spatial information, improved spatial point distribution, and increased overlap between frames, resulting in lower trajectory drift and better performance under reduced window sizes (Matsuki et al., 2018).
Rolling Shutter DSO: Incorporates continuous-time rolling-shutter trajectory modeling and a constant-velocity prior per keyframe, solving for both poses and velocities, and enforcing the rolling-shutter constraint at reprojection. This yields superior results over global-shutter DSO and prior direct methods in rolling-shutter sequences (Schubert et al., 2018).
Event-aided DSO (EDS): Fuses asynchronous event camera brightness increments with the DSO photometric BA, adding brightness-increment error factors for sparse points between frames. This enables robust, high-frequency tracking in the blind intervals between images, increases accuracy under high dynamics and low-light, and maintains robust tracking at very low camera frame-rates (Hidalgo-Carrió et al., 2022).
Ceiling-DSO: Adapts DSO for upward-facing, ceiling-vision robots in industrial environments with parameter tuning for robustness to large static background, confirming accuracy under slow motion and robustness to diverse ceiling patterns (Bougouffa et al., 2024).

Stereo DSO: Incorporates static stereo photometric constraints (left/right) as well as monocular temporal ones, with a coupling parameter for weighting. This enforces metric scale, strongly anchors point depths, and virtually eliminates scale drift, outperforming other direct and indirect stereo approaches (Wang et al., 2017).
DSOL (Direct Sparse Odometry Lite): Introduces a series of algorithmic/implementation advances—frame-to-window alignment, inverse-compositional tracking, improved outlier/depth init logic—to accelerate stereo DSO by 5–7× while improving robustness in high-speed scenarios. DSOL runs up to 800 Hz tracking and 100 Hz keyframes on a laptop (Qu et al., 2022).
Sparse2Dense (S2D): Uses a jointly trained CNN for depth and normal estimation to provide a monocular depth prior and perform sparse-to-dense 3D reconstruction. The CNN predictions are scale-aligned online using optimized DSO depths; surface normals enable rapid propagation of sparse depths to a dense map, yielding superior trajectory and dense reconstruction accuracy compared to monocular and stereo baselines (Tang et al., 2019).
Deep Virtual Stereo Odometry (DVSO): Leverages a semi-supervised stacked network to predict left/right monocular disparities, uses predicted depths for initialization, and introduces direct virtual-stereo photometric residuals into the DSO windowed BA, thereby recovering metric scale in monocular VO and matching stereo-DSO in trajectory accuracy (Yang et al., 2018).

e) Spatial and Semantic Regularization

PVI-DSO: Detects planar regularities in the sparse DSO 3D mesh and enforces coplanarity constraints for all points lying on detected planes. The coplanar points are eliminated from the state and their inverse depths are analytically determined via the plane parameters, which are then jointly optimized for via bundle adjustment. This produces sharper 3D maps and notably improves pose accuracy and convergence rate (Xu et al., 2022).
Gaussian Map DSO: Merges Dense Gaussian Maps (learned from LiDAR + image) as a continuous, differentiable prior for assigning depth to all high-gradient pixels, avoiding discrete point cloud interpolation errors. Depths from the map are held fixed, and only camera poses are jointly optimized, resulting in low ATE and stable, robust pose estimation in indoor mapping contexts (Deng et al., 5 Mar 2025).

3. Detailed Photometric and Optimization Model

A distinguishing feature of DSO and its variants is the fully probabilistic, robustified photometric modeling. The photometric error for each point is derived from a local image patch, robustified by the Huber norm, and further stabilized by per-residual gradient-based weighting:

$N_p$ 6

with $N_p$ 7 a small constant acting as an outlier bound. After stacking all residuals, the normal equations for the joint optimization (poses, depths, brightness parameters, intrinsics) are

$N_p$ 8

using the appropriate Lie-group updates for pose variables and consistent marginalization (Schur complement) to manage the sliding window. Jacobians are computed via chain rule, differentiating through the left/right pose, point inverse depth, brightness parameters, and camera intrinsics. Analytical Jacobians are carefully derived for specific extensions, e.g., coplanar constraints in PVI-DSO (Xu et al., 2022), rolling shutter time-warped points (Schubert et al., 2018), or virtual stereo terms (Yang et al., 2018).

In all cases, the optimization exploits the sparsity induced by each residual impacting only a small subset of variables (two poses, one depth, four brightness parameters), allowing for real-time computation.

4. Sampling, Calibration, and Practical Considerations

Pixel selection and photometric calibration play a central role in DSO robustness and scalability:

Uniform Block Sampling: Keyframes are divided into grids, with the highest-gradient pixel in each cell selected above threshold. Recursive block size reduction extends coverage to weaker gradients.
Photometric Calibration: Camera response function ( $N_p$ 9), vignetting map ( $p$ 0), per-frame exposure times ( $p$ 1), and affine brightness factors ( $p$ 2) are either measured, estimated, or included as variables. All input images are pre-corrected for photometric inconsistencies prior to BA.
Marginalization: Old keyframes and points are efficiently marginalized out using the Schur complement, converting their information into quadratic priors on the surviving variables. This ensures constant-time operation regardless of trajectory length.
Real-Time Performance: DSO achieves $p$ 325–30 Hz on standard hardware with sensible parameter settings ( $p$ 4, $p$ 5), and with streamlined variants (DSOL) demonstrates scalability to hundreds of Hz with multi-threaded or embedded implementation (Qu et al., 2022).

5. Empirical Results and Quantitative Performance

DSO and its descendants have been quantitatively benchmarked on a wide range of standard datasets:

Dataset	DSO-type	ATE / Trajectory Error	Comment
TUM monoVO	DSO	1.5 cm (median $p$ 6) (Engel et al., 2016)	Outperforms ORB-SLAM, robust to calibration error
KITTI	Stereo DSO	$p$ 7 t/RMSE (Wang et al., 2017)	Competitive with ORB-SLAM2, no loop closure
ICL-NUIM	SalientDSO	0.031–0.126 m (ATE) (Liang et al., 2018)	Nearly halves error vs. vanilla DSO
EuRoC MAV	PVI-DSO	$p$ 8 m RMSE (Xu et al., 2022)	7% lower than VI-DSO baseline
KITTI	DVSO (mono)	$p$ 9 RMSE (Yang et al., 2018)	Matches stereo DSO, superior to monocular DSO/ORB-SLAM
Indoor Dense	Gaussian Map DSO	$i$ 0 m avg (ATE) (Deng et al., 5 Mar 2025)	Outperforms prior direct/indirect map fusion methods

Results demonstrate that DSO and its advanced variants consistently reduce drift, increase robustness, and in many cases outperform both direct and indirect SLAM in standard benchmarks across both indoor and outdoor, monocular, stereo, and wide-FOV settings.

6. Limitations and Future Prospects

While DSO delivers state-of-the-art pure visual odometry, several limitations remain:

Scale drift is unavoidable in monocular DSO without external metric information or robust loop closure; stereo or deep-learning-based priors partially resolve this (Wang et al., 2017, Yang et al., 2018, Zhao et al., 2019).
Time-varying illumination and photometric inconsistency, especially in rolling-shutter and high-dynamic-range settings, require specialized modeling (Schubert et al., 2018).
Motion degeneracy (planarity, pure rotation), occlusion, and textureless regions pose challenges for standard pixel selection and depth initialization, though semantic and normal priors (SalientDSO, S2D) can mitigate these (Liang et al., 2018, Tang et al., 2019).
Computational overhead from deep inference or semantic processing may limit real-time application, though algorithmic and data structure improvements continue to expand the hardware applicability (Qu et al., 2022).

Research directions include tighter integration of semantics and geometry for robust and informative sampling, hardware-accelerated or learning-based data association, further improvement of dense mapping via continuous or learned priors, and cross-modal integration (event cameras, inertial, LiDAR) for robust odometry under the full range of real-world operating conditions. The DSO codebase and its major extensions are widely available, enabling rapid adoption and benchmarking throughout the visual SLAM community.

References:

"Direct Sparse Odometry" (Engel et al., 2016)
"SalientDSO: Bringing Attention to Direct Sparse Odometry" (Liang et al., 2018)
"Omnidirectional DSO: Direct Sparse Odometry with Fisheye Cameras" (Matsuki et al., 2018)
"Direct Sparse Odometry with Rolling Shutter" (Schubert et al., 2018)
"Stereo DSO: Large-Scale Direct Sparse Visual Odometry with Stereo Cameras" (Wang et al., 2017)
"Deep Direct Visual Odometry" (Zhao et al., 2019)
"Deep Virtual Stereo Odometry" (Yang et al., 2018)
"Sparse2Dense: From direct sparse odometry to dense 3D reconstruction" (Tang et al., 2019)
"Event-aided Direct Sparse Odometry" (Hidalgo-Carrió et al., 2022)
"PVI-DSO: Leveraging Planar Regularities for Direct Sparse Visual-Inertial Odometry" (Xu et al., 2022)
"DSOL: A Fast Direct Sparse Odometry Scheme" (Qu et al., 2022)
"Direct Sparse Odometry with Continuous 3D Gaussian Maps for Indoor Environments" (Deng et al., 5 Mar 2025)
"An indoor DSO-based ceiling-vision odometry system for indoor industrial environments" (Bougouffa et al., 2024)
"LDSO: Direct Sparse Odometry with Loop Closure" (Gao et al., 2018)
"Tight Integration of Feature-based Relocalization in Monocular Direct Visual Odometry" (Gladkova et al., 2021)