CLIDD: Cross-Layer Independent Deformable Description
- The method employs per-keypoint, cross-layer deformable sampling with learnable offsets to capture fine-grained structural details and boost matching accuracy.
- Its hardware-aware kernel fusion minimizes memory bottlenecks, doubling throughput and enabling efficient real-time performance for applications like robotics and augmented reality.
- CLIDD offers scalable model variants, balancing computational load and performance via adjustable channel depths, residual blocks, and offset counts.
Cross-Layer Independent Deformable Description (CLIDD) is a local feature representation methodology designed to optimize the balance between discriminative power and computational efficiency for @@@@1@@@@ applications, such as robot navigation and augmented reality. Unlike traditional approaches that generate unified dense feature maps, CLIDD introduces a framework that samples sparse, per-keypoint descriptors directly across independent, multi-scale feature hierarchies. This is achieved through the use of learnable offsets for each keypoint and feature pyramid level, enabling fine-grained structural capture across scales. CLIDD is distinguished by its hardware-aware implementation, scalable design accommodating a wide spectrum of deployment constraints, and an empirically validated capacity for state-of-the-art matching accuracy and throughput (Yao et al., 14 Jan 2026).
1. Formal Definition and Feature Sampling
Let an input image be transformed into a feature pyramid with levels, producing feature maps , where denotes the (potentially continuous) spatial domain at each scale (normalized for typical resolutions such as $1/2$, $1/8$, $1/32$). For every detected keypoint at image coordinate , CLIDD predicts, for each layer , a set of learnable offsets . These offsets specify sampling points per keypoint per level: Because the offsets are predicted independently across layers, the method is termed "cross-layer independent." The outputs from all layers are concatenated across the channel dimension to yield: A linear learned projection (or a convolution) reduces this vector to the final descriptor .
2. Network Architecture and Data Flow
The CLIDD backbone comprises a lightweight CNN that constructs a three-level feature hierarchy:
- Stage 1: $1/2$ spatial downsampling; convolution with stride $2$, channels, a single ResNet block.
- Stage 2: Average pooling, followed by residual blocks, producing channels at $1/8$ resolution.
- Stage 3: Further average pooling, residual blocks, channels at $1/32$ scale.
ReLU activation is used system-wide for throughput maximization. Parallel to this feature extraction, a detection head operates on the $1/2$ scale, using convolutions for dimensionality reduction and PixelShuffle for upsampling, producing a dense keypoint heatmap. Non-maximum suppression extracts keypoint positions .
The CLIDD description head consists of:
- Cross-Layer Predictor (CL-Predictor): For each , extracts via bilinear interpolation at each layer, concatenates these, yields a cross-layer embedding of dimensions, then applies a pointwise convolution to output all offsets, which are reshaped as per layer.
- Layer-Independent Sampler (LI-Sampler): For each and , samples at position , producing samples per layer.
- Aggregation: Concatenates or flattens all layer samples to form a vector, projected by a pointwise operator to .
Key configuration parameters for model scaling (see Section 5) include channel widths , residual block counts , offsets , and output descriptor dim .
3. Hardware-Aware Kernel Fusion
Traditional sampling and aggregation routines for multi-scale features incur high latency when every sampled feature is written to global memory independently. CLIDD minimizes such bottlenecks by fusing these operations into a single custom GPU kernel. For each block of keypoints:
- Compute all offsets and positions in shared memory.
- Sample features with bilinear interpolation.
- Partially project sampled features into intermediate registers or fast on-chip SRAM.
- Aggregate per-layer projections on the fly.
- Write only the aggregated descriptors to global memory.
This fusion minimizes main-memory interaction. Empirical results indicate that kernel fusion doubles inference throughput at high keypoint counts and sustains over 80% of peak FPS up to 16,000 keypoints.
4. Training Paradigm: Loss Functions and Knowledge Distillation
CLIDD utilizes a compound loss for its training objective:
- DualSoftmax Loss : For metric learning on matching image pairs, uses a temperature-scaled, row- and column-normalized softmax on the descriptor similarity matrix , yielding a probability matrix . The loss is:
where is a keypoint visibility mask.
- Orthogonal-Procrustes Distillation Loss : For small model variants, aligns learned student descriptors to a teacher’s principal subspace, seeking the optimal orthogonal map that minimizes squared error, with:
- UnfoldSoftmax Loss : For detector distillation, transferring heatmaps predicted by teacher models (e.g. ALIKED-N32).
The complete objective is , with model-type-specific weighting.
5. Model Variants and Scalability
CLIDD supports a family of nine models tailored for diverse deployment scenarios. The models, ranging from "Atom" (A48) to "Ultra" (U128), trade off descriptor dimensionality and network capacity for throughput and accuracy. Table 1 below summarizes representative architectures:
| Model | Channels | Offsets | C_desc | Params (M) |
|---|---|---|---|---|
| A48 | 4,4,4 | 4 | 48 | 0.004 |
| N64 | 8,8,8 | 8 | 64 | 0.019 |
| T64 | 8,16,24 | 8 | 64 | 0.043 |
| S64 | 8,24,32 | 16 | 64 | 0.100 |
| U128 | 32,128,256 | 32 | 128 | 4.400 |
Scaling channel widths in the deeper layers and increasing offset count enables smooth adjustment between computational budget and accuracy.
6. Empirical Results and Benchmarks
CLIDD demonstrates state-of-the-art accuracy and exceptional efficiency across local feature matching tasks. On an NVIDIA Jetson Orin-NX (TensorRT, 480×640, 1024 points), CLIDD models achieve unmatched speed-accuracy trade-offs. Notably, the A48 (Atom) model achieves 881.1 FPS at 0.004M parameters, matching SuperPoint's precision with a 99.7% reduction in parameter count. The U128 (Ultra) model outperforms DINOv2-based and other SOTA methods while exceeding 200 FPS.
Representative benchmarking:
| Method | FPS | Dim | Params (M) |
|---|---|---|---|
| SuperPoint | 124.2 | 256 | 1.301 |
| AWDesc-T16 | 180.4 | 128 | 0.172 |
| EdgePoint2-E64 | 375.7 | 64 | 0.155 |
| Ours-A48 | 881.1 | 48 | 0.004 |
| Ours-U128 | 281.4 | 128 | 4.400 |
Accuracy assessments include:
- Homography estimation (HPatches MHA @1px/@3px/@5px): Ours-M64 achieves (56.11/84.81/91.11) vs. SuperPoint (49.81/81.48/88.89).
- Relative pose AUC@10° (MegaDepth-1500): Ours-U128 77.15%, outperforming DeDoDe-G 76.13% and ALIKED-N32 73.56%.
- Visual localization (Aachen, 0.5 m/5°): Ours-U128 achieves 95.8% (day) and 91.1% (night).
Ablation studies confirm that the CL-Predictor plus LI-Sampler provide a ~2 AUC gain over single-layer or shared-offset baselines and that kernel fusion doubles throughput at high keypoint densities. Descriptor discriminativeness is maintained even when densely sampling up to 10,000 points without NMS.
7. Significance and Impact
CLIDD represents a rethinking of feature descriptor architecture for real-time visual correspondence. Its use of per-keypoint, cross-layer deformable sampling sidesteps the memory and computational inefficiencies of dense, unified maps and allows model scaling to new levels of compactness and speed. The adoption of hardware-aware kernel fusion further enables high-throughput inference, particularly critical for embedded and edge scenarios.
The framework’s empirical dominance in speed and accuracy benchmarks, combined with its flexible scaling and training strategies, positions it as a robust solution for real-time spatial intelligence on resource-constrained platforms. Comparative results demonstrate consistent improvements over SuperPoint, DISK, ALIKED, AWDesc, EdgePoint2, DeDoDe, XFeat, and DINOv2-based frameworks in key metrics across standard local feature evaluation tasks (Yao et al., 14 Jan 2026).