DPV-SLAM: Patch-based Sliding-Window BA

Updated 11 December 2025

DPV-SLAM is a visual SLAM method that leverages learned patch representations and sliding-window BA for globally consistent pose estimation.
It integrates efficient patch extraction, deep feature tracking, and GPU-optimized sparse BA to reduce memory usage and computational complexity.
Empirical results demonstrate high frame rates, low absolute trajectory errors, and robust performance across diverse environments.

A patch-based sliding-window bundle adjustment (BA) pipeline is a visual SLAM methodology that capitalizes on sparse, learnable patch representations, modern deep network feature encoding, and highly optimized block-sparse optimization backends. DPV-SLAM (Deep Patch Visual SLAM) exemplifies this paradigm, enabling real-time, accurate monocular SLAM with stringent hardware constraints by integrating efficient patch tracking, sliding-window BA, proximity-based loop closure, and fully GPU-parallelized kernels. Originally extended from DPVO (Deep Patch Visual Odometry), DPV-SLAM enables globally consistent pose and mapping at high framerates and reduced VRAM usage, thus overcoming many limitations of dense deep SLAM systems (Lipson et al., 3 Aug 2024).

1. Architectural Overview

DPV-SLAM divides the process into two principal modules: the frontend (patch extraction, feature encoding, and patch-graph maintenance) and the backend (sliding-window BA and keyframe management).

Frontend: For each monocular input image $I_i$ , a fixed set of $K_i$ patches is detected, positioned at uniformly sampled grid locations. Each patch $\mathbf{P}_{ik} = (x_{ik}, y_{ik}, 1, d_{ik})$ encodes its pixel centroid and a current inverse-depth estimate. A deep feature backbone extracts per-frame feature maps $\mathbf{f}_i$ , shared across all input images. A directed patch-graph $F \subset \{(i,k,j)\}$ connects tracked patches across frame pairs, maintaining accurate correspondences throughout the window.
Backend: A sliding window $\mathcal{W}$ of $N$ keyframes is maintained; each keyframe permanently contributes its patches to the patch-graph. Whenever $\lvert\mathcal{W}\rvert > N$ , the oldest keyframe is marginalized using the Schur complement, replacing its contributions with a compact pose-prior. The BA optimization jointly solves for all window poses $G_i$ and patch depths $d_{ik}$ , sometimes including camera intrinsics.

This design ensures independence between patch-edges, which allows embarrassingly parallel evaluation and maximizes modern GPU utilization. All components, from feature extraction to Cholesky solves, are executed in a single consolidated CUDA/PyTorch process (Lipson et al., 3 Aug 2024).

2. Patch Representation, Tracking, and Correspondence

Patch-based correspondence underlies the efficiency of DPV-SLAM:

Patch Extraction: Uniformly sampled patches are initialized with inverse depth and their centroids. Under each new input frame, features around each patch are propagated forward with deep features from the learned backbone (Teed et al., 2022).
Tracking: For each patch edge $(i, k, j)$ , the source patch is reprojected into the destination frame as $\mathbf{P}'_{ikj} = \Pi(G_j^{-1} G_i \Pi^{-1}(\mathbf{P}_{ik}))$ . A small local correlation volume is computed via dot-products between source and destination feature patches. The network predicts a 2D residual offset $\Delta_{ikj}$ and a confidence vector $w_{ikj} \in \mathbb{R}^2$ , yielding an ideal corrected reprojection $\mathcal{I}_{ikj} = \mathbf{P}'_{ikj} + \Delta_{ikj}$ .
Graph Structure: The directed nature of patch-edges facilitates selective memory retention—only features in current "destination" frames must be kept alive for optimization, significantly curtailing memory usage during global adjustment and loop closure sweeps (Lipson et al., 3 Aug 2024).

This patch-centric strategy, as opposed to dense correlation volumes, reduces computational complexity without compromising localization accuracy (Lipson et al., 3 Aug 2024, Teed et al., 2022).

3. Bundle Adjustment Formulation and Optimization

The sliding-window BA in DPV-SLAM solves for $6N$ pose parameters and all patch inverse depths within the window:

Residuals: For each edge $(i, k, j)$ , a geometric residual is defined as $r_{ikj}(G_i, G_j, d_{ik}) = \Pi(G_j^{-1} G_i \Pi^{-1}(x_{ik}, y_{ik}, d_{ik}; \Theta)) - \mathcal{I}_{ikj}$ , weighted by network-predicted confidences $\Sigma_{ikj} = \operatorname{diag}(w_{ikj})$ and robustified by a loss $\rho$ .
Objective: The cost function aggregates all active correspondence factors:

$\min_{\{G, d\}} \sum_{(i, k, j) \in F} \rho\big(r_{ikj}^\top \Sigma_{ikj}^{-1} r_{ikj}\big)$

Sparsity and Schur Complement: Each residual links two poses and one depth—leading to a block-sparse normal matrix $J^\top W J$ , efficiently marginalized with the Schur complement. Fast sparse Cholesky factorization further accelerates optimization. When a frame exits the window, its pose and patch depths are marginalized, and the resulting prior (a dense $6 \times 6$ block per removed frame) is added to the next Hessian (Lipson et al., 3 Aug 2024).
Differentiable Implementation: In DPV-SLAM (and its DPVO ancestor), BA is implemented as differentiable routines, enabling end-to-end gradient flow if desired during learning (Teed et al., 2022).

4. Keyframe and Sliding-Window Policy

Keyframe insertion and window management are adaptive:

Insertion: New keyframes are triggered every $5$–$10$ frames, or whenever the motion baseline (rotation or translation) exceeds specified thresholds.
Marginalization: If the window exceeds $N_\text{max}$ , the oldest keyframe is marginalized by forming the Schur complement. This prior encodes the condensed information from marginalized variables into a compact form usable in subsequent optimization passes.
Loop Closure (Proximity): Periodically (e.g., every $200$ frames), a proximity loop-closure search identifies earlier keyframes within a spatial Euclidean radius. For each, new patch-edges are created bridging old patches to the current frame, enabling global BA sweeps (over the expanded window only). The directionality of edges again ensures that only destination-frame features are required, keeping VRAM usage low even for global optimizations (Lipson et al., 3 Aug 2024).

Such flexible window management, augmented by memory-efficient marginalization policies, ensures both real-time speed and minimal memory footprint even in extended sequences.

5. Performance, Implementation, and Empirical Results

DPV-SLAM delivers competitive performance benchmarks using commodity hardware:

Efficiency: Sustained real-time operation (up to $1$– $4\times$ realtime) is maintained. For EuRoC, DPV-SLAM achieves $50$ FPS and $5$ GB VRAM with mean ATE $0.024$ m, while DROID-SLAM matches $0.022$ m ATE but at only $20$ FPS and $20$ GB VRAM. On TartanAir, DPV-SLAM achieves $0.016$ m ATE and $50$ FPS with $5$ GB VRAM (Lipson et al., 3 Aug 2024).
Stability: Instantaneous FPS remains tightly distributed around the nominal odometry rate, with brief, infrequent dips during (proximity) global BA. The overall depth and accuracy is robust, with low drift and consistent trajectory estimates.
Hardware Utilization: The method is implemented with a unified CUDA process (PyTorch as host). Feature maps are retained for only the last $M$ frames (e.g., $M=10$ ); patch centroids and depths are kept for all keyframes; marginalized priors are highly compact.
Parallelization: All correlation and Jacobian assembly operations are batched across all patch-graph edges. Block-sparse BA and Cholesky routines are realized with custom CUDA kernels, with dynamic switching between dense and sparse solvers based on the current window size (Lipson et al., 3 Aug 2024).
Empirical Validation: On UAV imagery, adaptive patch-based sliding-window BA achieves similar accuracy to global BA, with mean reprojection errors $0.710$ px (vs. $0.727$ px for incremental BA) and real-time solve times ( $<1.2$ s per $50$ MP frame cluster) (Iz et al., 8 Nov 2025). ATE and attitude errors are kept in the sub-decimeter and sub-tenth-degree regimes, matching rigorous geodetic requirements.

Patch-based sliding window BA as exemplified by DPV-SLAM is a direct response to the inefficiencies and hardware demands of dense optical flow or dense-correspondence SLAM systems. By replacing full-image correlation and feature volume retention with sparse, learned patches and directed edge-graphs, the system achieves superior scalability.

Relation to DPVO: DPV-SLAM extends DPVO, adding global consistency and loop-closure while preserving the differentiable, patch-centric architecture (Teed et al., 2022).
Compatibility with High-Resolution Imagery: On ultra-high-resolution UAV platforms, patch-based sliding-window BA adapts by fixing patch region sizes (e.g., $150 \times 150$ px), dynamically updating with GNSS/IMU priors, and leveraging overlap/distance clustering for scalable optimization (Iz et al., 8 Nov 2025).
Design Rationale: Sparse patches confine feature matching and reprojection to locally rigid regions, while overlapping temporal clusters propagate corrections and enforce global consistency. Marginalization and robust loss penalization stabilize optimization against outliers and degeneracies (Iz et al., 8 Nov 2025).
Supplementary Techniques: Integration of classical loop-closure (image retrieval-based 7-DoF scale-graph optimization), adaptive keyframe selection, and proximity-based loop-closure further enhance global trajectory fidelity.

A plausible implication is that these strategies represent a general blueprint for scalable, learning-aided SLAM on both terrestrial and aerial visual datasets, supporting downstream dense mapping at minimal resource cost (Lipson et al., 3 Aug 2024, Iz et al., 8 Nov 2025, Teed et al., 2022).

PDF Markdown Chat (Pro)

References (3)

Deep Patch Visual SLAM (2024)

Deep Patch Visual Odometry (2022)

Real-Time Bundle Adjustment for Ultra-High-Resolution UAV Imagery Using Adaptive Patch-Based Feature Tracking (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Patch-based Sliding-window Bundle Adjustment Pipeline (DPV-SLAM).

DPV-SLAM: Patch-based Sliding-Window BA

1. Architectural Overview

2. Patch Representation, Tracking, and Correspondence

3. Bundle Adjustment Formulation and Optimization

4. Keyframe and Sliding-Window Policy

5. Performance, Implementation, and Empirical Results

6. Broader Context, Extensions, and Related Work

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics