Papers
Topics
Authors
Recent
Search
2000 character limit reached

CLIDD: Cross-Layer Independent Deformable Description

Updated 15 January 2026
  • The method employs per-keypoint, cross-layer deformable sampling with learnable offsets to capture fine-grained structural details and boost matching accuracy.
  • Its hardware-aware kernel fusion minimizes memory bottlenecks, doubling throughput and enabling efficient real-time performance for applications like robotics and augmented reality.
  • CLIDD offers scalable model variants, balancing computational load and performance via adjustable channel depths, residual blocks, and offset counts.

Cross-Layer Independent Deformable Description (CLIDD) is a local feature representation methodology designed to optimize the balance between discriminative power and computational efficiency for @@@@1@@@@ applications, such as robot navigation and augmented reality. Unlike traditional approaches that generate unified dense feature maps, CLIDD introduces a framework that samples sparse, per-keypoint descriptors directly across independent, multi-scale feature hierarchies. This is achieved through the use of learnable offsets for each keypoint and feature pyramid level, enabling fine-grained structural capture across scales. CLIDD is distinguished by its hardware-aware implementation, scalable design accommodating a wide spectrum of deployment constraints, and an empirically validated capacity for state-of-the-art matching accuracy and throughput (Yao et al., 14 Jan 2026).

1. Formal Definition and Feature Sampling

Let an input image be transformed into a feature pyramid with LL levels, producing feature maps Fl:ΩlRClF^l: \Omega^l \rightarrow \mathbb{R}^{C_l}, where Ωl\Omega^l denotes the (potentially continuous) spatial domain at each scale (normalized [0,1]2[0,1]^2 for typical resolutions such as $1/2$, $1/8$, $1/32$). For every detected keypoint at image coordinate xx, CLIDD predicts, for each layer ll, a set of MM learnable offsets Δl(x)RM×2\Delta^l(x) \in \mathbb{R}^{M \times 2}. These offsets specify MM sampling points per keypoint per level: Dl(x)={Fl(x+Δkl(x))k=1M}RM×ClD^l(x) = \{ F^l(x + \Delta^l_k(x))\,|\,k=1\ldots M \} \in \mathbb{R}^{M\times C_l} Because the offsets are predicted independently across layers, the method is termed "cross-layer independent." The outputs from all layers are concatenated across the channel dimension to yield: D(x)=Concatl=1LDl(x)RM(C1++CL)D(x) = \text{Concat}_{l=1}^L D^l(x) \in \mathbb{R}^{M \cdot (C_1+\ldots+C_L)} A linear learned projection WRCdesc×[MlCl]W \in \mathbb{R}^{C_{\text{desc}}\times [M \cdot \sum_l C_l]} (or a 1×11\times1 convolution) reduces this vector to the final descriptor D(x)RCdescD(x) \in \mathbb{R}^{C_{\text{desc}}}.

2. Network Architecture and Data Flow

The CLIDD backbone comprises a lightweight CNN that constructs a three-level feature hierarchy:

  • Stage 1: $1/2$ spatial downsampling; 4×44\times4 convolution with stride $2$, C1C_1 channels, a single ResNet block.
  • Stage 2: Average pooling, followed by r2r_2 residual blocks, producing C2C_2 channels at $1/8$ resolution.
  • Stage 3: Further average pooling, r3r_3 residual blocks, C3C_3 channels at $1/32$ scale.

ReLU activation is used system-wide for throughput maximization. Parallel to this feature extraction, a detection head operates on the $1/2$ scale, using 1×11\times1 convolutions for dimensionality reduction and PixelShuffle for upsampling, producing a dense keypoint heatmap. Non-maximum suppression extracts keypoint positions xix_i.

The CLIDD description head consists of:

  • Cross-Layer Predictor (CL-Predictor): For each xix_i, extracts Fl(xi)F^l(x_i) via bilinear interpolation at each layer, concatenates these, yields a cross-layer embedding of lCl\sum_l C_l dimensions, then applies a pointwise convolution to output all LM×2LM\times2 offsets, which are reshaped as MM per layer.
  • Layer-Independent Sampler (LI-Sampler): For each ll and kk, samples FlF^l at position xi+Δkl(xi)x_i + \Delta^l_k(x_i), producing MM samples per layer.
  • Aggregation: Concatenates or flattens all layer samples to form a MlClM\cdot\sum_l C_l vector, projected by a pointwise operator to CdescC_{\text{desc}}.

Key configuration parameters for model scaling (see Section 5) include channel widths (C1,C2,C3)(C_1,C_2,C_3), residual block counts (r2,r3)(r_2,r_3), offsets MM, and output descriptor dim CdescC_{\text{desc}}.

3. Hardware-Aware Kernel Fusion

Traditional sampling and aggregation routines for multi-scale features incur high latency when every sampled feature is written to global memory independently. CLIDD minimizes such bottlenecks by fusing these operations into a single custom GPU kernel. For each block of BB keypoints:

  • Compute all offsets and positions in shared memory.
  • Sample features with bilinear interpolation.
  • Partially project sampled features into intermediate registers or fast on-chip SRAM.
  • Aggregate per-layer projections on the fly.
  • Write only the aggregated descriptors to global memory.

This fusion minimizes main-memory interaction. Empirical results indicate that kernel fusion doubles inference throughput at high keypoint counts and sustains over 80% of peak FPS up to 16,000 keypoints.

4. Training Paradigm: Loss Functions and Knowledge Distillation

CLIDD utilizes a compound loss for its training objective:

  • DualSoftmax Loss LDSL_{\text{DS}}: For metric learning on matching image pairs, uses a temperature-scaled, row- and column-normalized softmax on the descriptor similarity matrix S=DADBTS = D_A D_B^T, yielding a probability matrix PP. The loss is:

LDS=11TmimilogPiiL_{\text{DS}} = -\frac{1}{1^T m} \sum_i m_i \log P_{ii}

where mm is a keypoint visibility mask.

  • Orthogonal-Procrustes Distillation Loss LOPL_{\text{OP}}: For small model variants, aligns learned student descriptors DAD_A to a teacher’s DtchrD_{\text{tchr}} principal subspace, seeking the optimal orthogonal map Ω\Omega^* that minimizes squared error, with:

LOP=1diag(DnΩDAT)F2L_{\text{OP}} = \left\Vert 1 - \text{diag}(D_n \Omega D_A^T) \right\Vert_F^2

  • UnfoldSoftmax Loss LUSL_{\text{US}}: For detector distillation, transferring heatmaps predicted by teacher models (e.g. ALIKED-N32).

The complete objective is Ltotal=wDSLDS+wOPLOP+wUSLUSL_{\text{total}} = w_{\text{DS}} L_{\text{DS}} + w_{\text{OP}} L_{\text{OP}} + w_{\text{US}} L_{\text{US}}, with model-type-specific weighting.

5. Model Variants and Scalability

CLIDD supports a family of nine models tailored for diverse deployment scenarios. The models, ranging from "Atom" (A48) to "Ultra" (U128), trade off descriptor dimensionality and network capacity for throughput and accuracy. Table 1 below summarizes representative architectures:

Model Channels (C1,C2,C3)(C_1,C_2,C_3) Offsets MM C_desc Params (M)
A48 4,4,4 4 48 0.004
N64 8,8,8 8 64 0.019
T64 8,16,24 8 64 0.043
S64 8,24,32 16 64 0.100
U128 32,128,256 32 128 4.400

Scaling channel widths in the deeper layers and increasing offset count MM enables smooth adjustment between computational budget and accuracy.

6. Empirical Results and Benchmarks

CLIDD demonstrates state-of-the-art accuracy and exceptional efficiency across local feature matching tasks. On an NVIDIA Jetson Orin-NX (TensorRT, 480×640, 1024 points), CLIDD models achieve unmatched speed-accuracy trade-offs. Notably, the A48 (Atom) model achieves 881.1 FPS at 0.004M parameters, matching SuperPoint's precision with a 99.7% reduction in parameter count. The U128 (Ultra) model outperforms DINOv2-based and other SOTA methods while exceeding 200 FPS.

Representative benchmarking:

Method FPS Dim Params (M)
SuperPoint 124.2 256 1.301
AWDesc-T16 180.4 128 0.172
EdgePoint2-E64 375.7 64 0.155
Ours-A48 881.1 48 0.004
Ours-U128 281.4 128 4.400

Accuracy assessments include:

  • Homography estimation (HPatches MHA @1px/@3px/@5px): Ours-M64 achieves (56.11/84.81/91.11) vs. SuperPoint (49.81/81.48/88.89).
  • Relative pose AUC@10° (MegaDepth-1500): Ours-U128 77.15%, outperforming DeDoDe-G 76.13% and ALIKED-N32 73.56%.
  • Visual localization (Aachen, 0.5 m/5°): Ours-U128 achieves 95.8% (day) and 91.1% (night).

Ablation studies confirm that the CL-Predictor plus LI-Sampler provide a ~2 AUC gain over single-layer or shared-offset baselines and that kernel fusion doubles throughput at high keypoint densities. Descriptor discriminativeness is maintained even when densely sampling up to 10,000 points without NMS.

7. Significance and Impact

CLIDD represents a rethinking of feature descriptor architecture for real-time visual correspondence. Its use of per-keypoint, cross-layer deformable sampling sidesteps the memory and computational inefficiencies of dense, unified maps and allows model scaling to new levels of compactness and speed. The adoption of hardware-aware kernel fusion further enables high-throughput inference, particularly critical for embedded and edge scenarios.

The framework’s empirical dominance in speed and accuracy benchmarks, combined with its flexible scaling and training strategies, positions it as a robust solution for real-time spatial intelligence on resource-constrained platforms. Comparative results demonstrate consistent improvements over SuperPoint, DISK, ALIKED, AWDesc, EdgePoint2, DeDoDe, XFeat, and DINOv2-based frameworks in key metrics across standard local feature evaluation tasks (Yao et al., 14 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-Layer Independent Deformable Description (CLIDD).