Lightweight Depth Adaptation Module

Updated 3 May 2026

Lightweight depth adaptation modules are compact neural architectures that refine depth cues with minimal computational overhead.
They employ efficient mechanisms such as attention, low-rank adaptation, and sparse fusion to effectively propagate and correct sparse depth information.
Empirical evaluations demonstrate improved performance metrics, including reduced RMSE and error minimization, while maintaining low parameter counts and FLOP budgets.

A lightweight depth adaptation module is a compact neural architecture integrated into depth completion or monocular depth estimation networks to enable spatial or domain adaptation with minimal computational and memory overhead. Such modules are engineered to propagate, refine, or adapt depth-related cues—especially when sensor coverage is incomplete, data is sparse, or test-time conditions differ from supervised training regimes. Key principles include minimizing parameter count, constraining FLOPs, and targeting only subspaces or bottleneck points in the network for adaptation, often enabling rapid online or test-time optimization without sacrificing real-time deployability.

1. Architectural Taxonomy and Major Design Patterns

Lightweight depth adaptation modules are most commonly found in architectures focused on efficient depth completion or depth estimation, such as CFPNet, PRV2, PuriLight, SelfToF*, and self-supervised refiner/decorator strategies (Ding et al., 2024, Li et al., 2 Jan 2025, Chen et al., 11 Feb 2026, Ding et al., 16 Jun 2025, Seo et al., 2 Mar 2026, Ji et al., 16 Apr 2025, Park et al., 2024). These can be grouped by purpose and operational mode as follows:

Feature Propagation and Fusion: Modules for propagating sparse or localized depth cues into unmeasured regions (e.g., DAPM and LKPM in CFPNet; Guided Feature Fusion and Submanifold encoders in SelfToF*).
Parameter-Efficient Test-Time Adaptation: Adapter modules (e.g., low-rank LoRA in (Seo et al., 2 Mar 2026, Ji et al., 16 Apr 2025)) or small convolutional layers that adapt only limited weights or subspaces.
Attention and Context Aggregation: Attention blocks (Selective Depth Attention (Guo et al., 2022), channel-spatial adapters, Frequency Signal Purification (Chen et al., 11 Feb 2026)) to aggregate or align multi-scale/semantic cues.
Boundary Refinement and Denoising: Denoising units in refiner branches (PatchRefiner V2 (Li et al., 2 Jan 2025)) to recover fine structure from noisy lightweight encoders.
Cross-Modal Prompting: Plugin modules injecting geometry-informed prompts in multimodal networks for tasks such as semantic segmentation (GeomPrompt (Jaganathan et al., 13 Apr 2026)).

Many modules operate as “plug-ins” or bottleneck layers—either at fusion points, as residual adapters, or at decoder heads—rather than modifying the full backbone.

2. Representative Modules and Core Mechanisms

Cross-Zone Propagation in Depth Completion

Direct-Attention-Propagation Module (DAPM) (Ding et al., 2024):

Allows outside-zone (camera FOV, not ToF-covered) pixels to attend directly to in-zone (valid ToF) pixels with a multi-head cross-attention pattern:

$Q = X_{\rm out} W_Q, \quad K = X_{\rm in} W_K, \quad V = X_{\rm in} W_V$

$\mathrm{head}_i = \mathrm{softmax}\!\left(\frac{Q_i K_i^\top}{\sqrt{d}}\right)V_i$

Fused with a $1\times1$ and $3\times3$ conv and residual skip, incurring $O(N_{\rm out} N_{\rm in} d)$ compute.

Large-Kernel-Propagation Module (LKPM):

A ConvNeXt-style block with a single deep $s\times s$ depthwise convolution ( $s\in\{7,15,31\}$ ) followed by pointwise $1\times1$ convs, achieving receptive fields up to $51\times51$ with modest $O(C s^2)$ parameter count and negligible FLOP increase.

Guided Sparse-Depth Fusion and Correction

Guided Feature Fusion (GFF, SelfToF*) (Ding et al., 16 Jun 2025):

At low spatial resolution (e.g., $\mathrm{head}_i = \mathrm{softmax}\!\left(\frac{Q_i K_i^\top}{\sqrt{d}}\right)V_i$ 0 maps), applies RGB-to-depth affinity propagation using learned query/key projections, elementary multiplications, but restricted to $\mathrm{head}_i = \mathrm{softmax}\!\left(\frac{Q_i K_i^\top}{\sqrt{d}}\right)V_i$ 1 operations at small $\mathrm{head}_i = \mathrm{softmax}\!\left(\frac{Q_i K_i^\top}{\sqrt{d}}\right)V_i$ 2, yielding negligible overhead ( $\mathrm{head}_i = \mathrm{softmax}\!\left(\frac{Q_i K_i^\top}{\sqrt{d}}\right)V_i$ 3 G FLOPs).

Submanifold Sparsity Encoders:

Sparse convolutions mask out invalid (missing) ToF zones, preventing feature “bleeding” across zone boundaries, with FLOP savings due to work only on valid locations.

Lightweight Adapter Branches for Test-Time Optimization

Low-Rank Decoder LoRA Modules (Seo et al., 2 Mar 2026, Ji et al., 16 Apr 2025):

Only decoder weights are adapted at test time:

$\mathrm{head}_i = \mathrm{softmax}\!\left(\frac{Q_i K_i^\top}{\sqrt{d}}\right)V_i$ 4

With $\mathrm{head}_i = \mathrm{softmax}\!\left(\frac{Q_i K_i^\top}{\sqrt{d}}\right)V_i$ 5, representing $\mathrm{head}_i = \mathrm{softmax}\!\left(\frac{Q_i K_i^\top}{\sqrt{d}}\right)V_i$ 6 model parameters; adaptation is restricted to a few tens of steps for sub-3 s adaptation on VGA.

Single-Layer Convolutional Adaptation (Park et al., 2024):

Auxiliary convolution $\mathrm{head}_i = \mathrm{softmax}\!\left(\frac{Q_i K_i^\top}{\sqrt{d}}\right)V_i$ 7 is inserted in the RGB branch before RGB-depth fusion; typically a $\mathrm{head}_i = \mathrm{softmax}\!\left(\frac{Q_i K_i^\top}{\sqrt{d}}\right)V_i$ 8 conv with $\mathrm{head}_i = \mathrm{softmax}\!\left(\frac{Q_i K_i^\top}{\sqrt{d}}\right)V_i$ 9 M parameters, residually added to features.

Detail Enhancement for Boundary Recovery

Guided Denoising Units (GDU, PatchRefiner V2) (Li et al., 2 Jan 2025):

At each scale, a channelwise weight $1\times1$ 0 is computed by fusing refiner and coarse guide features via sigmoid(conv(cat( $1\times1$ 1))); GDU output is $1\times1$ 2, followed by a residual conv stack.

Plug-and-Play Attention and Frequency Modules

Shuffle-Dilation Convolution (SDC, PuriLight) (Chen et al., 11 Feb 2026):

Convolutions with group/channel shuffle interleaved with distinct, small-dilation kernels are used to preserve local detail in compact models.

Rotation-Adaptive Kernel Attention (RAKA) and DFSP:

RAKA applies triplet attention over rotated axes with minimal params ( $1\times1$ 3 per block). DFSP achieves global, denoising context aggregation via learnable Fourier masks, at O(N log N) complexity and $1\times1$ 4 M params across the model.

3. Training Regimes and Loss Functions

Supervision is often provided via scale-invariant losses, sparse-depth consistency, or proxy-alignment objectives:

Scale-Invariant (SI) Loss: Widely used for dense regression (Ding et al., 2024, Manghotay et al., 1 Apr 2026):

$1\times1$ 5

Test-Time Optimization: Loss on sparse pixels, augmented by scale/shift alignment, with all updates restricted to lightweight adapter parameters (Seo et al., 2 Mar 2026).
Proxy Losses for Domain Adaptation (Park et al., 2024): Cosine similarity between proxy embedding predicted from sparse depth and the target joint feature.
Auxiliary Smoothness and Denoising Losses: Edge-aware or gradient-matching objectives regulate structure transfer without overfitting adapters (Li et al., 2 Jan 2025).
Task-Driven Losses: In plug-in prompting modules (GeomPrompt), no metric depth supervision is required; the only loss is on downstream task output, with weak regularization on the prompt signal (Jaganathan et al., 13 Apr 2026).

4. Empirical Performance and Efficiency

Lightweight depth adaptation modules demonstrate robust performance at low parameter and FLOP budgets:

CFPNet (Ding et al., 2024):

$1\times1$ 6 million parameters for both DAPM and LKPM above DELTAR* baseline ( $1\times1$ 7 M total); sub-60 ms inference at $1\times1$ 8; $1\times1$ 9 reduced by 0.03–0.04.

GFF+SDE (SelfToF*) (Ding et al., 16 Jun 2025):

$3\times3$ 0 M total parameters, $3\times3$ 1 M for fusion; $3\times3$ 2 FPS at $3\times3$ 3; error growth with ToF dropouts minimized to $3\times3$ 4 (abs_rel) at 40% missing zones.

PatchRefiner V2 (Li et al., 2 Jan 2025):

Refiner branches with MobileNet: $3\times3$ 5 M params, $3\times3$ 6 s per 4K image, improving RMSE from $3\times3$ 7 (coarse) to $3\times3$ 8; ConvNeXt-L variant achieves state-of-the-art boundaries.

PuriLight (Chen et al., 11 Feb 2026):

$3\times3$ 9 M parameters, $O(N_{\rm out} N_{\rm in} d)$ 0 G FLOPs—nearly $O(N_{\rm out} N_{\rm in} d)$ 1 less than prior SOTA lightweight models—with competitive AbsRel and generalization across KITTI/Make3D.

Low-Rank Adapters (LoRA) (Seo et al., 2 Mar 2026, Ji et al., 16 Apr 2025):

Typically $O(N_{\rm out} N_{\rm in} d)$ 2 of model adapted; adaptation time $O(N_{\rm out} N_{\rm in} d)$ 3 s (vs $O(N_{\rm out} N_{\rm in} d)$ 4– $O(N_{\rm out} N_{\rm in} d)$ 5 s for full optimization), with $O(N_{\rm out} N_{\rm in} d)$ 6– $O(N_{\rm out} N_{\rm in} d)$ 7 relative error reduction.

Segmentation Plugins (Jaganathan et al., 13 Apr 2026):

GeomPrompt achieves +6.1 mIoU improvement (DFormer backbone) over RGB-only on SUN RGB-D with $O(N_{\rm out} N_{\rm in} d)$ 8 M params and $O(N_{\rm out} N_{\rm in} d)$ 9 ms latency (compared with $s\times s$ 0 ms for full monocular estimation).

5. Module Integration and Application Scenarios

Table: Selected Lightweight Depth Adaptation Modules

Module/Strategy	Core Mechanism	Model/Task	Param/FLOP Cost
DAPM	Out→in attention, local conv	CFPNet (ToF completion)	+1.29 M, $s\times s$ 1 FLOPs
LKPM	Depthwise $s\times s$ 2 conv	CFPNet	+0.45 M, small
GFF + SDE	Sparse fusion, submanifold	SelfToF* (ToF/RGB)	$s\times s$ 3 M
Low-Rank Adapter	LoRA, only decoder/adapter	UniDepthV2-L, R-DepthNet	$s\times s$ 4 (<0.7 M)
GDU (C2F)	Spatial gating, denoising	PatchRefiner V2 (Monocular)	$s\times s$ 5 M total PRV2
SDC + RAKA + DFSP	Shuffle, attention, FFT	PuriLight (Mono-Depth)	Full: 2.7 M
m_φ RGB Adapter	3x3 conv+ReLU, residual	ProxyTTA-fast/DepthComplete	$s\times s$ 6 M
GeomPrompt(-Recovery)	Task-driven prompt generator	Segmentation plugins	$s\times s$ 7 M, 7.8 ms

Integration is context-driven: propagation modules are crucial in sparse-depth completion, denoising adapters for boundary-focused monocular refinement, and plug-and-play attention/fusion modules for cross-modal or few-shot adaptation.

6. Best Practices, Limitations, and Extensions

A prominent best practice is constraining adaptation strictly to modules with inherent bottlenecks (e.g., adapters at decoder heads, middle-layer fusion, or prompt generators), which enhances efficiency, avoids catastrophic drift, and supports rapid convergence. Parameter initialization strategies (e.g., zero or small Gaussian for LoRA, noise pretraining for refiners) further stabilize adaptation without harming pre-trained weights.

A notable limitation is the inability of most modules—outside self-supervised or proxy losses—to adapt to truly out-of-distribution geometric failures if no in-situ supervision or sufficient sparse guidance is available. Future directions include expanding prompt-guided adaptation to 3D-centric or open-set scenes, integrating low-rank or sparse tensor adaptation into transformer blocks, and fusing self-supervised modules for multi-modal global-local consistency.

7. References

"CFPNet: Improving Lightweight ToF Depth Completion via Cross-zone Feature Propagation" (Ding et al., 2024)
"SDA- $s\times s$ 8Net: Selective Depth Attention Networks for Adaptive Multi-scale Feature Representation" (Guo et al., 2022)
"Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation" (Seo et al., 2 Mar 2026)
"Test-Time Adaptation for Depth Completion" (Park et al., 2024)
"GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth" (Jaganathan et al., 13 Apr 2026)
"Self-Supervised Enhancement for Depth from a Lightweight ToF Sensor with Monocular Images" (Ding et al., 16 Jun 2025)
"PatchRefiner V2: Fast and Lightweight Real-Domain High-Resolution Metric Depth Estimation" (Li et al., 2 Jan 2025)
"PuriLight: A Lightweight Shuffle and Purification Framework for Monocular Depth Estimation" (Chen et al., 11 Feb 2026)
"An Online Adaptation Method for Robust Depth Estimation and Visual Odometry in the Open World" (Ji et al., 16 Apr 2025)
"Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation" (Manghotay et al., 1 Apr 2026)