CLIDD: Cross-Layer Independent Deformable Description

Updated 15 January 2026

The method employs per-keypoint, cross-layer deformable sampling with learnable offsets to capture fine-grained structural details and boost matching accuracy.
Its hardware-aware kernel fusion minimizes memory bottlenecks, doubling throughput and enabling efficient real-time performance for applications like robotics and augmented reality.
CLIDD offers scalable model variants, balancing computational load and performance via adjustable channel depths, residual blocks, and offset counts.

Cross-Layer Independent Deformable Description (CLIDD) is a local feature representation methodology designed to optimize the balance between discriminative power and computational efficiency for @@@@1@@@@ applications, such as robot navigation and augmented reality. Unlike traditional approaches that generate unified dense feature maps, CLIDD introduces a framework that samples sparse, per-keypoint descriptors directly across independent, multi-scale feature hierarchies. This is achieved through the use of learnable offsets for each keypoint and feature pyramid level, enabling fine-grained structural capture across scales. CLIDD is distinguished by its hardware-aware implementation, scalable design accommodating a wide spectrum of deployment constraints, and an empirically validated capacity for state-of-the-art matching accuracy and throughput (Yao et al., 14 Jan 2026).

1. Formal Definition and Feature Sampling

Let an input image be transformed into a feature pyramid with $L$ levels, producing feature maps $F^l: \Omega^l \rightarrow \mathbb{R}^{C_l}$ , where $\Omega^l$ denotes the (potentially continuous) spatial domain at each scale (normalized $[0,1]^2$ for typical resolutions such as $1/2$, $1/8$, $1/32$). For every detected keypoint at image coordinate $x$ , CLIDD predicts, for each layer $l$ , a set of $M$ learnable offsets $\Delta^l(x) \in \mathbb{R}^{M \times 2}$ . These offsets specify $M$ sampling points per keypoint per level: $D^l(x) = \{ F^l(x + \Delta^l_k(x))\,|\,k=1\ldots M \} \in \mathbb{R}^{M\times C_l}$ Because the offsets are predicted independently across layers, the method is termed "cross-layer independent." The outputs from all layers are concatenated across the channel dimension to yield: $D(x) = \text{Concat}_{l=1}^L D^l(x) \in \mathbb{R}^{M \cdot (C_1+\ldots+C_L)}$ A linear learned projection $W \in \mathbb{R}^{C_{\text{desc}}\times [M \cdot \sum_l C_l]}$ (or a $1\times1$ convolution) reduces this vector to the final descriptor $D(x) \in \mathbb{R}^{C_{\text{desc}}}$ .

2. Network Architecture and Data Flow

The CLIDD backbone comprises a lightweight CNN that constructs a three-level feature hierarchy:

Stage 1: $1/2$ spatial downsampling; $4\times4$ convolution with stride $2$, $C_1$ channels, a single ResNet block.
Stage 2: Average pooling, followed by $r_2$ residual blocks, producing $C_2$ channels at $1/8$ resolution.
Stage 3: Further average pooling, $r_3$ residual blocks, $C_3$ channels at $1/32$ scale.

ReLU activation is used system-wide for throughput maximization. Parallel to this feature extraction, a detection head operates on the $1/2$ scale, using $1\times1$ convolutions for dimensionality reduction and PixelShuffle for upsampling, producing a dense keypoint heatmap. Non-maximum suppression extracts keypoint positions $x_i$ .

The CLIDD description head consists of:

Cross-Layer Predictor (CL-Predictor): For each $x_i$ , extracts $F^l(x_i)$ via bilinear interpolation at each layer, concatenates these, yields a cross-layer embedding of $\sum_l C_l$ dimensions, then applies a pointwise convolution to output all $LM\times2$ offsets, which are reshaped as $M$ per layer.
Layer-Independent Sampler (LI-Sampler): For each $l$ and $k$ , samples $F^l$ at position $x_i + \Delta^l_k(x_i)$ , producing $M$ samples per layer.
Aggregation: Concatenates or flattens all layer samples to form a $M\cdot\sum_l C_l$ vector, projected by a pointwise operator to $C_{\text{desc}}$ .

Key configuration parameters for model scaling (see Section 5) include channel widths $(C_1,C_2,C_3)$ , residual block counts $(r_2,r_3)$ , offsets $M$ , and output descriptor dim $C_{\text{desc}}$ .

3. Hardware-Aware Kernel Fusion

Traditional sampling and aggregation routines for multi-scale features incur high latency when every sampled feature is written to global memory independently. CLIDD minimizes such bottlenecks by fusing these operations into a single custom GPU kernel. For each block of $B$ keypoints:

Compute all offsets and positions in shared memory.
Sample features with bilinear interpolation.
Partially project sampled features into intermediate registers or fast on-chip SRAM.
Aggregate per-layer projections on the fly.
Write only the aggregated descriptors to global memory.

This fusion minimizes main-memory interaction. Empirical results indicate that kernel fusion doubles inference throughput at high keypoint counts and sustains over 80% of peak FPS up to 16,000 keypoints.

4. Training Paradigm: Loss Functions and Knowledge Distillation

CLIDD utilizes a compound loss for its training objective:

DualSoftmax Loss $L_{\text{DS}}$ : For metric learning on matching image pairs, uses a temperature-scaled, row- and column-normalized softmax on the descriptor similarity matrix $S = D_A D_B^T$ , yielding a probability matrix $P$ . The loss is:

$L_{\text{DS}} = -\frac{1}{1^T m} \sum_i m_i \log P_{ii}$

where $m$ is a keypoint visibility mask.

Orthogonal-Procrustes Distillation Loss $L_{\text{OP}}$ : For small model variants, aligns learned student descriptors $D_A$ to a teacher’s $D_{\text{tchr}}$ principal subspace, seeking the optimal orthogonal map $\Omega^*$ that minimizes squared error, with:

$L_{\text{OP}} = \left\Vert 1 - \text{diag}(D_n \Omega D_A^T) \right\Vert_F^2$

UnfoldSoftmax Loss $L_{\text{US}}$ : For detector distillation, transferring heatmaps predicted by teacher models (e.g. ALIKED-N32).

The complete objective is $L_{\text{total}} = w_{\text{DS}} L_{\text{DS}} + w_{\text{OP}} L_{\text{OP}} + w_{\text{US}} L_{\text{US}}$ , with model-type-specific weighting.

5. Model Variants and Scalability

CLIDD supports a family of nine models tailored for diverse deployment scenarios. The models, ranging from "Atom" (A48) to "Ultra" (U128), trade off descriptor dimensionality and network capacity for throughput and accuracy. Table 1 below summarizes representative architectures:

Model	Channels $(C_1,C_2,C_3)$	Offsets $M$	C_desc	Params (M)
A48	4,4,4	4	48	0.004
N64	8,8,8	8	64	0.019
T64	8,16,24	8	64	0.043
S64	8,24,32	16	64	0.100
U128	32,128,256	32	128	4.400

Scaling channel widths in the deeper layers and increasing offset count $M$ enables smooth adjustment between computational budget and accuracy.

6. Empirical Results and Benchmarks

CLIDD demonstrates state-of-the-art accuracy and exceptional efficiency across local feature matching tasks. On an NVIDIA Jetson Orin-NX (TensorRT, 480×640, 1024 points), CLIDD models achieve unmatched speed-accuracy trade-offs. Notably, the A48 (Atom) model achieves 881.1 FPS at 0.004M parameters, matching SuperPoint's precision with a 99.7% reduction in parameter count. The U128 (Ultra) model outperforms DINOv2-based and other SOTA methods while exceeding 200 FPS.

Representative benchmarking:

Method	FPS	Dim	Params (M)
SuperPoint	124.2	256	1.301
AWDesc-T16	180.4	128	0.172
EdgePoint2-E64	375.7	64	0.155
Ours-A48	881.1	48	0.004
Ours-U128	281.4	128	4.400

Accuracy assessments include:

Homography estimation (HPatches MHA @1px/@3px/@5px): Ours-M64 achieves (56.11/84.81/91.11) vs. SuperPoint (49.81/81.48/88.89).
Relative pose AUC@10° (MegaDepth-1500): Ours-U128 77.15%, outperforming DeDoDe-G 76.13% and ALIKED-N32 73.56%.
Visual localization (Aachen, 0.5 m/5°): Ours-U128 achieves 95.8% (day) and 91.1% (night).

Ablation studies confirm that the CL-Predictor plus LI-Sampler provide a ~2 AUC gain over single-layer or shared-offset baselines and that kernel fusion doubles throughput at high keypoint densities. Descriptor discriminativeness is maintained even when densely sampling up to 10,000 points without NMS.

7. Significance and Impact

CLIDD represents a rethinking of feature descriptor architecture for real-time visual correspondence. Its use of per-keypoint, cross-layer deformable sampling sidesteps the memory and computational inefficiencies of dense, unified maps and allows model scaling to new levels of compactness and speed. The adoption of hardware-aware kernel fusion further enables high-throughput inference, particularly critical for embedded and edge scenarios.

The framework’s empirical dominance in speed and accuracy benchmarks, combined with its flexible scaling and training strategies, positions it as a robust solution for real-time spatial intelligence on resource-constrained platforms. Comparative results demonstrate consistent improvements over SuperPoint, DISK, ALIKED, AWDesc, EdgePoint2, DeDoDe, XFeat, and DINOv2-based frameworks in key metrics across standard local feature evaluation tasks (Yao et al., 14 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

CLIDD: Cross-Layer Independent Deformable Description for Efficient and Discriminative Local Feature Representation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-Layer Independent Deformable Description (CLIDD).