Learnable Spacing & Interpolation (DCLS)
- Learnable Spacing and Interpolation (DCLS) is a neural technique that parameterizes non-zero convolutional tap positions as continuous, learnable variables.
- It employs differentiable interpolation methods, like bilinear and Gaussian kernels, to optimize tap positions via gradient descent for adaptive receptive fields.
- Empirical results show that DCLS boosts performance and interpretability in vision, audio, and spiking neural networks, while modestly increasing computational cost.
Learnable Spacing and Interpolation (DCLS) refers to a family of neural architectural techniques whereby the positions of the non-zero elements (“taps”) in a convolutional or delay kernel are parameterized as continuous, learnable variables rather than fixed, integer positions on a regular grid. By leveraging a differentiable interpolation operator, these positions can be optimized via gradient descent, enabling the network to adapt receptive fields, temporal alignments, or context aggregation to the task and data distribution. DCLS was initially introduced in vision and later extended to audio, spiking neural networks, and various deep learning modalities, where it has shown consistent improvements over standard and classical dilated convolutions across a range of supervised learning benchmarks (Khalfaoui-Hassani et al., 2021, Khalfaoui-Hassani et al., 2023, Khalfaoui-Hassani, 2024, Khalfaoui-Hassani et al., 2023, Chamas et al., 2024, Hammouamri et al., 2023).
1. Mathematical Formulation and Core Principles
For a standard -dimensional dilated convolution, the kernel samples the input at fixed, integer-offset positions determined by a dilation factor. DCLS generalizes this by introducing a set of learnable (real-valued) offsets for each kernel tap. In 2D, a DCLS kernel with nonzero elements is parameterized by learnable weights and real-valued positions : For non-integer positions , DCLS employs differentiable interpolation schemes to sample (e.g., bilinear, triangle, or Gaussian kernels). These interpolations guarantee that gradients with respect to both kernel weights and positional parameters are well-defined, ensuring effective end-to-end training (Khalfaoui-Hassani et al., 2021, Khalfaoui-Hassani et al., 2023, Khalfaoui-Hassani, 2024).
The n-dimensional generalization is immediate: Each kernel tap is parameterized by an -vector of continuous offsets, and task-adaptive sampling is performed via an appropriate, typically separable, interpolation kernel.
2. Interpolation Schemes and Learnability
DCLS achieves differentiability with respect to position parameters by using interpolation kernels with continuous derivatives. Two primary classes are used:
- Bilinear (triangle) interpolation: Each tap contributes mass to the four nearest grid points, weighted by the area of overlap. The interpolation function is , yielding the classic bilinear 2×2 stencil.
- Gaussian interpolation: Each tap spreads its mass over an entire neighborhood according to a normalized Gaussian kernel . The spread 0 can itself be a learnable parameter or scheduled during training.
The forward map for a single “impulse” 1 at 2 with scale 3 is given by: 4 where 5 is either the triangle or the Gaussian kernel, and normalization ensures the total mass equals 6 (Khalfaoui-Hassani et al., 2023).
This flexible interpolation provides smooth, informative gradients for weights and positions, enabling arbitrary positioning of kernel taps and, for Gaussian, modulation of their local receptive field extent.
3. Implementation and Optimization
The canonical DCLS implementation is a two-step process:
- Kernel Construction: For each convolutional group or channel, synthesize a sparse, large spatial kernel of size 7 by “splat-and-sum” of the learned weighted impulses via the chosen interpolant. The process can be vectorized over batches, channels, and kernel count for memory and compute efficiency.
- Convolution: Apply any standard convolution routine to perform the actual filtering with the synthesized kernel over the input feature map.
Gradient propagation is automatic under standard deep learning frameworks since the interpolation operators are differentiable. Careful scheduling of learning rates—typically increasing those for position variables by a factor ≈5 relative to weights—yields stable convergence. No weight decay is applied to position or 8 parameters, only to weights (Khalfaoui-Hassani, 2024, Khalfaoui-Hassani et al., 2021).
Position and scale parameters are initialized uniformly or near a regular grid, with optional clamping at each step to respect valid kernel support. In practice, position sharing across layers or repulsive regularization to avoid tap overlap can further stabilize training (Khalfaoui-Hassani et al., 2023, Khalfaoui-Hassani et al., 2021).
Pseudocode for kernel assembly is provided below: 8 For the 1D case, used in temporal processing and SNNs, a Gaussian bump is constructed as a kernel tap with learnable delay; in the limit 9, this approaches a delta-function at integer delay (Hammouamri et al., 2023).
4. Integration Across Domains
DCLS is a drop-in replacement for classical 2D/1D convolutions and dilated convolutions in deep architectures:
- CNNs: Depthwise and pointwise convolutions in ResNet, ConvNeXt, ConvFormer, FastViT, and related vision models can be replaced by DCLS, e.g. using a dilated-kernel-size 0 or 1, and kernel count 2 (often ≈34 for ConvNeXt) at fixed parameter budget (Khalfaoui-Hassani et al., 2021, Khalfaoui-Hassani, 2024, Khalfaoui-Hassani et al., 2023, Chamas et al., 2024).
- Hybrid CNN-attention: DCLS layers can be integrated into hybrid modules, replacing only the convolutional parts to exploit data-adaptive spatial filtering. Attention heads remain untouched (Khalfaoui-Hassani, 2024).
- Spiking Neural Networks: DCLS-1D is used for learnable axonal delays, transforming each synapse into a delay-line with position and width optimized for spatiotemporal pattern detection. The Gaussian interpolant yields both smooth gradients and sharp spike-timing alignment (Hammouamri et al., 2023, Khalfaoui-Hassani, 2024).
- Audio and Speech: In time-frequency CNNs for audio tagging, DCLS enables data-driven learning of spectrotemporal receptive fields, improving context aggregation for features extracted from log-Mel spectrograms (Khalfaoui-Hassani et al., 2023, Khalfaoui-Hassani, 2024).
5. Empirical Evaluation and Performance
Experiments across vision, audio, and spiking benchmarks uniformly demonstrate that DCLS-based models outperform or match their fixed-grid counterparts at iso-parameter count, with modest throughput reductions due to larger effective receptive fields. Key results:
| Task/Model | Baseline | DCLS | Δ |
|---|---|---|---|
| ConvNeXt-T ImageNet top-1 | 82.1% | 82.5% (m=34,s=17) | +0.4 |
| ConvNeXt-B ImageNet top-1 | 83.8% | 84.1% (m=34,s=17) | +0.3 |
| ADE20K Segmentation mIoU | 46.0 – 49.1 | 47.1 – 49.3 | +1.1 |
| SHD (SNN, 10-class) | 94.62% | 95.07% ± 0.24 | +0.45 |
| AudioSet mAP (ConvNeXt-T) | 44.83% | 45.52% | +0.7 |
| Speech Command (SNN) | 77.4% | 80.7% | +3.3 |
Throughput reductions are minor for depthwise-separable settings (e.g., ConvNeXt-DCLS 6% throughput drop), and the parameter overhead is negligible (per-kernel: 2D positions, optional 3). Ablations confirm that DCLS’s gain is not replicated by “just” more layers or isolated learnable delays; the key is interpolated, flexible context adaptation (Hammouamri et al., 2023, Khalfaoui-Hassani et al., 2023, Khalfaoui-Hassani et al., 2023, Chamas et al., 2024, Khalfaoui-Hassani, 2024).
6. Interpretability and Alignment
Recent Grad-CAM studies show that models equipped with DCLS not only outperform in accuracy but also exhibit increased interpretability as measured by alignment with human attention heatmaps (ClickMe): Spearman correlations for DCLS-augmented ConvNeXt models are 4–5 points higher; ResNet50 increases from 0.6135 to 0.6252 (standard Grad-CAM) and 0.7125 to 0.7261 (Threshold-Grad-CAM). Seven of eight tested architectures saw improvements, with only specialized kernel reparametrizations (FastViT_sa24) showing slight degradation (Chamas et al., 2024). This suggests DCLS adaptively concentrates model attention on task-relevant spatial regions.
7. Limitations, Extensions, and Future Work
The main limitations stem from increased FLOPs and memory when the dilated-kernel size 4 is large, and from the use of separable interpolation (triangle or Gaussian), which may not fully exploit non-axis-aligned patterns. Gains are typically saturated at moderate kernel sizes (e.g., 5 achieves ≥95% of possible improvement); much larger 6 yields diminishing returns (Khalfaoui-Hassani, 2024).
Current implementations lack sparse matrix optimization and efficient custom CUDA kernels for extremely large or sparse DCLS kernels, and DCLS integration in multi-dimensional (3D, video) settings is underexplored. Further directions include the study of adaptive 7 learning versus scheduled decay, dynamic kernel count, integration into local-attention modules, quantization for neuromorphic deployment, and hardware-aware DCLS variants (Khalfaoui-Hassani et al., 2023, Khalfaoui-Hassani, 2024, Hammouamri et al., 2023, Khalfaoui-Hassani et al., 2023).
References:
- (Khalfaoui-Hassani et al., 2021): Dilated convolution with learnable spacings
- (Khalfaoui-Hassani et al., 2023): Dilated Convolution with Learnable Spacings: beyond bilinear interpolation
- (Khalfaoui-Hassani, 2024): Dilated Convolution with Learnable Spacings
- (Khalfaoui-Hassani et al., 2023): Audio classification with Dilated Convolution with Learnable Spacings
- (Chamas et al., 2024): Dilated Convolution with Learnable Spacings makes visual models more aligned with humans: a Grad-CAM study
- (Hammouamri et al., 2023): Learning Delays in Spiking Neural Networks using Dilated Convolutions with Learnable Spacings