Dense Cosine Similarity Maps Overview

Updated 10 February 2026

Dense Cosine Similarity Maps are a dense, detector-free representation that computes pixel-wise cosine similarities using ℓ₂-normalized descriptors.
The method employs a fully convolutional ResNet-based architecture to extract per-pixel descriptors, enabling precise correspondence estimation between images.
Contrastive training with synthetic augmentations ensures robust matching under geometric and photometric distortions, outperforming traditional keypoint-based methods.

Dense Cosine Similarity Maps (DCSMs) provide a fully dense, detector-free representation of pixelwise correspondence between images by leveraging local descriptors and the cosine similarity measure. DCSMs support robust dense image matching under diverse geometric and photometric distortions, obviating the need for explicit keypoint detection. They are constructed by extracting ℓ₂-normalized descriptors at every pixel using a convolutional neural network and computing the cosine similarity for every possible pair of spatial locations across source and target images, enabling fine-grained pixel-level correspondence estimation in challenging visual conditions (Kwiatkowski et al., 2024).

1. Descriptor Network Architecture

The central component for generating DCSMs is a fully convolutional deep network with the following structure:

Input: RGB image $x \in \mathbb{R}^{3 \times H \times W}$ .
Backbone: A compact ResNet-style architecture with:
- Initial $3 \times 3$ convolution (stride 1).
- Ten residual blocks (each comprising two $3 \times 3$ convolutions, batch normalization, and ReLU, all strides 1).
- Final $1 \times 1$ convolution to yield $d$ channels ( $d = 128$ ).
Output: Dense feature map $f_\theta(x) \in \mathbb{R}^{d \times H \times W}$ , preserving the full spatial resolution.
Receptive field: Approximately $43 \times 43$ pixels as dictated by the convolutional stack.
Per-pixel descriptors: For a spatial location $(x, y)$ in image $i$ , the descriptor is $f_i(x, y) = f_\theta(x_i)[:, y, x] \in \mathbb{R}^d$ .

This architecture ensures that each pixel in the input image is associated with a descriptor that summarizes information from its local context but maintains spatial correspondence with the input (Kwiatkowski et al., 2024).

2. Construction and Definition of DCSMs

Given two images, $x_1$ and $x_2$ , and their extracted dense feature maps, $f_1 \in \mathbb{R}^{d \times H \times W}$ and $f_2 \in \mathbb{R}^{d \times H \times W}$ , the DCSM $S$ assigns to each pair of pixels $((x, y), (x', y'))$ the cosine similarity between their descriptors: $S((x, y), (x', y')) = \frac{\langle f_1(x, y), f_2(x', y') \rangle}{\| f_1(x, y) \|_2 \, \| f_2(x', y') \|_2}$ All descriptors are $\ell_2$ -normalized prior to dot product computation, so $S \in [-1,1]$ . The resulting similarity tensor encodes a dense correspondence likelihood for every location pair between the two images. The DCSM supports fully detector-free matching, relying solely on per-pixel features (Kwiatkowski et al., 2024).

3. Contrastive Training of Dense Descriptors

Dense descriptors are optimized by contrastive learning under strong geometric perturbations:

Positive pair sampling: For each training batch, a uniform grid of $N$ points $\{p_i\}_{i=1}^N$ is sampled in image $A$ , projected to image $B$ under ground-truth homography $\mathcal{H}$ , followed by small random spatial jitter.
Descriptor sampling: Features at corresponding (possibly non-integer) grid locations are extracted via differentiable bilinear sampling (Spatial Transformer mechanism).
Similarity matrix: For the descriptor sets $\{a_i\}$ from image $A$ and $\{b_j\}$ from image $B$ , the similarity matrix is $S_{ij} = \langle a_i, b_j \rangle$ .
Bi-directional InfoNCE loss (CLIP-style): Compute softmax distributions over matrix rows and columns:

$p_A(i, j) = \frac{\exp(S_{ij})}{\sum_k \exp(S_{ik})}, \quad p_B(i, j) = \frac{\exp(S_{ij})}{\sum_k \exp(S_{kj})}$

Define primary and dual cross-entropy losses:

$L_A = -\frac{1}{N} \sum_{i=1}^N \log p_A(i, i), \quad L_B = -\frac{1}{N} \sum_{i=1}^N \log p_B(i, i)$

Total loss:

$L = \frac{L_A + L_B}{2}$

Minimizing $L$ jointly promotes high cosine similarity for true correspondences and low similarity for all other pairs (Kwiatkowski et al., 2024).

4. Synthetic Augmentations and Regularization

Training utilizes a synthetic data pipeline (SIDAR) to maximize descriptor invariance:

Augmentation types: Perspective warps, occlusions, shadows, specular reflections, and complex illumination changes are extensively applied to image pairs.
Grid jitter: Random offset is added to grid positions, mitigating overfitting to fixed pixel locations and injecting spatial regularization.
Negative sampling: Explicit mining is unnecessary; all non-corresponding grid locations in each batch serve as negatives and are included via the InfoNCE setup.

This approach enforces robustness to a wide range of appearance and geometric transformations during matching, setting the method apart from earlier approaches that rely on real-world pairs or limited augmentation strategies (Kwiatkowski et al., 2024).

5. Quantitative Evaluation and Comparative Results

Performance is evaluated on 4,000 image pairs with ground-truth homographies spanning undeformed and strongly distorted scenarios. The following protocols and metrics are employed:

Procedure:
- Estimate dense correspondences $\rightarrow$ RANSAC $\rightarrow$ estimate recovered homography $\hat{H}$ .
- Quantify error using Mean Corner Error (MCE):
$\mathrm{MCE}(H, \hat{H}) = \sum_{k=1}^4 \| H x_k - \hat{H} x_k \|_2$

where $x_k$ are image corners. - Measure pointwise reprojection error and count inliers at thresholds $t \in \{0.1, 1, 10\}$ px.
Outcomes:
- At 4 px grid sampling, ConDL achieves sub-pixel accuracy on a larger fraction of pairs than SuperGlue or LoFTR under strong distortions.
- Increasing sampling density to every 2 px yields more potential matches, but introduces more outliers and hence requires additional RANSAC iterations.
- Even under heavy distortion, $ℓ_2$ -normalized descriptors trained with SIDAR augmentations outperform all classical descriptors such as SIFT, though SIFT remains competitive (Kwiatkowski et al., 2024).

6. Implementation Details and Practical Considerations

Key practical factors underlying the deployment and training of DCSMs in ConDL are summarized below:

Component	Setting/value	Notes
Descriptor Network	ResNet-10 (128 channels, batch-norm, ReLU)	Fully convolutional
Sampling Grid	$16 \times 16$ (N=256) per image during training	Adjustable at inference for spatial coverage/control
Optimizer	Adam, $lr = 10^{-3}$ , $\beta_1=0.9$ , $\beta_2=0.999$ , $\epsilon=10^{-8}$
Training Schedule	500 epochs ( $\approx$ 60 h on NVIDIA RTX A6000, 48 GB)
Batch Size	16 image pairs
Negative Pairs	All non-diagonal grid pairs within each batch	No explicit negative mining

The all-pairs similarity is highly parallelizable and suited for modern GPU computation. The spatial invariance properties stem in part from aggressive synthetic augmentations and the differentiable spatial sampling process (Kwiatkowski et al., 2024).

7. Relationship to Broader Dense Matching and Descriptor Frameworks

DCSMs, as implemented in ConDL, represent a shift from detector-based and sparse-matching regimes to fully dense, learning-based correspondence estimation in the presence of extreme geometric and appearance variability. Unlike previous matching frameworks that depend on keypoint detectors (e.g., SIFT, SuperGlue), DCSMs enable per-pixel correspondence without explicit detection or pre-filtering, drawing on lessons from contrastive learning and synthetic augmentation pipelines. The adoption of a bi-directional InfoNCE loss parallels CLIP and related contrastive frameworks. The result is dense feature maps with strong invariance properties, enabling robust matching even under photometric and geometric disruptions typically challenging for traditional approaches (Kwiatkowski et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

ConDL: Detector-Free Dense Image Matching (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dense Cosine Similarity Maps (DCSMs).