Deformable Sparse Kernel Module

Updated 25 February 2026

DSK module is an adaptive filtering method that learns spatially-varying kernel positions and weights to improve aggregation over sparse data.
It employs compact neural networks to predict per-location offsets and coefficients, achieving state-of-the-art results in depth upsampling, video interpolation, and point cloud processing.
Key limitations include high parameter counts and challenges in handling large spatial displacements, which require careful regularization and architectural tuning.

The Deformable Sparse Kernel (DSK) module is a class of adaptive, learnable filtering and aggregation mechanisms that generalize fixed-grid convolution by enabling both the sampling positions and kernel weights to be dynamically predicted for each output location. DSK modules are designed to address the limitations of standard convolutions and spatially-invariant filtering, including inadequate adaptability to local structure, inefficiency in aggregating information in sparse data, and the inability to handle complex spatially-varying degradation. The DSK formulation has been instantiated in multiple settings: depth map upsampling, joint filtering, denoising, video frame interpolation, motion/depth blur handling in neural representation learning, and point cloud processing, among others. Key characteristics are their sparse support (sampling a small set of locations K per output), deformable sampling geometries (offsets learned w.r.t. a canonical grid), and explicit spatially-variant interpolation weights, all predicted by compact neural networks adapted to the task context.

1. Mathematical Formulation and Core Principles

The canonical DSK operation represents each output as a (possibly residual) weighted sum of input values sampled at spatially or spatio-temporally offset locations, with both the offsets and weights being location- and context-dependent:

$y(p) = \sum_{i=1}^K w_i(p)\,x\bigl(p + \Delta p_i(p)\bigr)$

where $x(\cdot)$ is the input (can be an image, depth map, video, or point cloud), $K$ is the support size (usually 9 or 16), $w_i(p)$ are per-sample weights, and $\Delta p_i(p)$ are predicted offsets (fractional for images/videos, real-valued for point clouds). Bilinear or trilinear interpolation is used for fractional coordinates to preserve differentiability.

For guided filtering or upsampling, residual forms are common:

$y(p) = x(p) + \sum_{i=1}^K w_i(p)\,[x(p + \Delta p_i(p)) - x(p)]$

This high-pass constraint (sum-to-zero of weights) sharpens results and prevents color shift or global bias (Kim et al., 2019).

Offsets and weights are predicted by small, fully-convolutional neural networks, typically acting on local or global context extracted from guidance images, intermediate features, or other modalities. Architectures differ by domain: two-stream CNNs for guided filtering (Kim et al., 2019), encoder-decoders or U-Nets for offset/weight prediction in video or denoising (Tian et al., 2022, Xu et al., 2019), and lightweight MLPs in ray-based representations (Ma et al., 2021). In point clouds, kernel point offsets are estimated by a rigid point-wise convolution on local geometric features (Thomas et al., 2019).

2. DSK Module Architectures Across Application Domains

DSK modules exhibit application-dependent network architectures but share the following fundamental structure:

Feature Extraction and Offset/Weight Regression: Feature encoders process the input domain (RGB/depth, frame stacks, grid or point clouds), regress K offsets and K weights per output location (or per point), with normalization (e.g., sigmoid, mean-subtraction, L1-normalization).
Deformable Weighted Aggregation: The input is sampled at locations displaced from a canonical grid/ball by the learned offsets, with each sample weighted by predicted kernel coefficients. Bilinear/trilinear sampling or distance-weighted kernel point influence is employed.
Auxiliary Regularization and Constraints: For stability, DSK modules may include regularization terms to keep offsets within reasonable domains (e.g., range constraints, fitting and repulsive losses in point cloud DSKs (Thomas et al., 2019); annealing in video denoising (Xu et al., 2019); explicit alignment losses for NeRF (Ma et al., 2021)).

Table: Representative DSK architectures and predictions

Domain	Offset/Weight Regression	Aggregation Mechanism
Image/Depth	2-stream conv nets	Bilinear sum over K samples
Video (interpolation)	U-Net (shared)	Bilinear sum + occlusion
Point Cloud	Point conv (KPConv)	Kernel points, influence fn
NeRF / blur modeling	MLP (per-point, per-ray)	Weighted rays, sum & blend

3. Task-Specific Instantiations and Benchmarks

The DSK module has demonstrated state-of-the-art or highly competitive performance in a range of vision and geometry tasks:

Depth Completion: A two-branch encoder-decoder first fuses sparse LiDAR and RGB cues to produce a coarse (∼16% dense) depth prediction. A single DSK refinement step improves RMSE by 20–30 mm on KITTI, achieving ∼728 mm test RMSE with ∼20 ms inference time. Kernel size k=3 (K=9) is optimal; denser or sparser support degrades results. Ablation shows DSK must be applied on sufficiently dense, guidance-fused input rather than raw sparse depth (Sun et al., 2023).
Guided Upsampling and Joint Filtering: DSKs as in DKN/FDKN explicitly learn sparse 3×3 or 4×4 kernels for each pixel, predicting both offsets and weights via two-stream CNNs. For upsampling ×4/×8/×16 on NYU v2 and Middlebury, DSK methods reduced RMSE by 30–50% compared to classical and deep networks. The FDKN variant reduces inference time by 17× without loss of accuracy via shift-and-stack architectural modifications (Kim et al., 2019, Kim et al., 2019).
Video Frame Interpolation and Denoising: In frame interpolation, DSKs enable spatially-adaptive kernel regions, outperforming baseline methods in PSNR/SSIM and handling motion and shape boundaries. For denoising, spatio-temporal DSKs (2D/3D kernels) reduce oversmoothing and handle large motion by dynamically allocating sample points across frames. Quantitative gains over non-deformable or fixed-kernel baselines are reported in PSNR/SSIM (e.g., 36.91 dB vs. 34–36 dB on standard denoising benchmarks) (Tian et al., 2022, Xu et al., 2019).
Point Cloud Processing: KPConv with deformable sparse kernel points adapts the support to local geometry, improving mIoU in scene/part segmentation and providing resistance to sparsity. Regularization terms prevent kernel drift and ensure robust coverage (Thomas et al., 2019).
Radiance Field Deblurring: In Deblur-NeRF, a DSK MLP predicts, for each ray and canonical kernel point, both 2D pixel offsets, 3D origin shifts, and blending weights, simulating spatially-varying motion and defocus blurs. The module enables learning a sharp, view-consistent NeRF from severely blurred image inputs via joint optimization (Ma et al., 2021).

4. Key Implementation Details and Hyperparameters

DSK modules exhibit the following typical properties and hyperparameters (as empirically validated in the respective papers):

Support Size: K = 9 (3×3) or 16 (4×4) is standard; marginal gains are observed beyond these.
Kernel Prediction: Single 1×1 convolutions or shallow MLP branches suffice for regressing weights/offsets in high-dimensional feature space.
Interpolation: Bilinear for images/videos, trilinear for spatio-temporal DSKs, linear distance-weighted for point clouds.
Loss Functions: Combination of L1, L2, and task-specific penalties (e.g., residual sum-to-zero for sharpening, regularization for offset stability).
Optimization: Adam is standard, learning rates 1e-3 to 5e-4, cosine annealing or step decay, batch sizes variable by task.
Inference Speed: FDKN/FDKN-style modules can reach 0.01 s per HR image (640×480); ReDC DSK ∼20 ms per KITTI frame; Deblur-NeRF report per-batch update speeds depending on scene/MLP and GPU.

5. Analytical Insights, Ablations, and Limitations

Thorough ablation studies and analysis validate the DSK design:

Effectiveness of Deformable Sampling: Static fixed-grid or rigid kernels invariably underperform. Allowing dynamic offset prediction (with or without explicit guidance fusion) robustly improves fine detail recovery, edge sharpness, alignment under motion, and resilience to sparse or missing data.
Support Density and Guidance: Performance degrades if DSK is applied too early (to very sparse input) or without strong structure guidance (e.g., only on raw sparse depth). Intermediate, moderately dense fused inputs are optimal (Sun et al., 2023).
Residual/High-Pass Constraints: For depth and upsampling, enforcing residual high-pass (weights sum to zero) improves sharpening and avoids color shift or bias (Kim et al., 2019, Kim et al., 2019).
Regularization: Deformable kernels can collapse or drift if unconstrained—addressed via range regularization, repulsive/fitting losses (KPConv), or annealing schedule for 3D DSK in videos. Without these, network stability and coverage degrade significantly (Thomas et al., 2019, Xu et al., 2019).

Limitations include high parameter counts in multi-branch architectures (e.g., U-Nets for per-pixel kernel/offset, flow estimation in frame interpolation), potential computational cost for very large supports, and limited ability to handle extremely large displacements unless the base grid or window is expanded (Tian et al., 2022).

6. Generalizations and Cross-Domain Applicability

The unifying abstraction of DSK is the explicit, per-output prediction of where and how much to aggregate from a sparse support, making it a flexible tool across disparate data modalities:

Structured Grid Data: Images, depth, video—DSK generalizes spatially-invariant filtering to location-adaptive, edge-aware processing.
Unstructured Point Clouds: DSK via KPConv adapts support to geometry, supporting segmentation and classification on non-uniform, unordered data.
Renderer/Inverse Graphics: DSK MLPs model blur PSFs by deforming rays in 2D/3D, simulating physical acquisition processes and enhancing robustness in neural representations (NeRF).

DSK modules are fully differentiable, implementable via standard deep-learning frameworks, and compatible with back-propagation for joint optimization with upstream/downstream encoders or volumetric predictors.

7. Comparative Performance and Prospects

Across vision and geometry tasks, DSK modules achieve state-of-the-art or highly competitive performance:

Depth completion/upsampling: RMSE improvement of 20–44% over prior art, with minimal runtime overhead (Sun et al., 2023, Kim et al., 2019, Kim et al., 2019).
Video tasks: Quantitative (PSNR/SSIM) and visual quality improvements in motion-rich/occluded scenes; order-of-magnitude runtime reduction in optimized FDKN/FDKN variants (Tian et al., 2022).
Point clouds: Less than 1.5% absolute mIoU drop (vs. rigid) under aggressive kernel sparsification, confirming the value of dynamic support (Thomas et al., 2019).
NeRF/deblurring: Reconstruction of sharp, view-consistent radiance fields from extremely blurred imagery by fully absorbing spatially-varying PSFs into a learnable, sparse kernel (Ma et al., 2021).

A plausible implication is that future DSK work will explore lighter-weight architectures for deployment, richer support parameterizations (beyond grids/balls), and joint modulation strategies (merging amplitude and sampling in unified heads). The approach continues to generalize well as a module for plug-in refinement or structural adaptation in many modalities, with demonstrated downstream gains in semantic segmentation and other postprocessing tasks (Kim et al., 2019).

References:

"Revisiting Deformable Convolution for Depth Completion" (Sun et al., 2023)
"Deformable kernel networks for guided depth map upsampling" (Kim et al., 2019)
"Deformable Kernel Networks for Joint Image Filtering" (Kim et al., 2019)
"Video Frame Interpolation Based on Deformable Kernel Region" (Tian et al., 2022)
"Learning Deformable Kernels for Image and Video Denoising" (Xu et al., 2019)
"KPConv: Flexible and Deformable Convolution for Point Clouds" (Thomas et al., 2019)
"Deblur-NeRF: Neural Radiance Fields from Blurry Images" (Ma et al., 2021)