Modality-Aware Jaccard Distance

Updated 15 December 2025

Modality-Aware Jaccard Distance is a similarity measure that integrates modality information into set-based calculations to mitigate biases in cross-modal tasks.
It adjusts neighbor retrieval by ensuring a balanced mix of intra- and inter-modality samples, leading to better clustering and contrastive learning outcomes.
The method has been shown to improve performance metrics in tasks like visible-infrared person re-identification and medical image representation by reducing modality-induced embedding gaps.

A modality-aware Jaccard distance is a set-based or graph-based similarity measure explicitly modified to account for data modality—most prominently, visual versus infrared or textual versus visual—during neighbor retrieval, overlap calculation, or loss weighting. By integrating knowledge of sample modality (or other metadata encodings) into the classical Jaccard distance framework, this family of methods aims to mitigate the performance degradation and identity confusion that arise from strong modality-induced embedding biases in unsupervised and multimodal learning tasks.

1. Mathematical Foundations and Definitions

The standard Jaccard distance for sample pairs $(i, j)$ is typically defined as

$d_J(i, j) = 1 - \frac{|S(i) \cap S(j)|}{|S(i) \cup S(j)|}$

where $S(i)$ and $S(j)$ are sets associated with samples, such as $k_1$ -reciprocal neighbor sets, attribute sets, or label vectors. In a modality-aware context, these sets or their construction procedures are systematically altered to enforce balanced participation from distinct modalities or attributes.

Given visible and infrared images $x_i$ , $x_j$ , let $N^{\mathrm{intra}}(i, m)$ denote the $m$ nearest neighbors of $x_i$ from the same modality, and $N^{\mathrm{inter}}(i, m)$ the $m$ nearest from the opposite modality under a feature-space metric $D(i, j)$ . For $k_1 = 2m$ ,

$N^*(i, k_1) = N^{\mathrm{intra}}(i, m) \cup N^{\mathrm{inter}}(i, m)$

Sorting $N^*(i, k_1)$ by $D(i, \cdot)$ produces $\overline{N}^*(i, k_1)$ . The modality-aware reciprocal set is defined as

$R^*(i, k_1) = \big\{j \in \overline{N}^*(i, k_1) \mid i \in \overline{N}^*(j, k_1)\big\}$

The modality-aware Jaccard distance is then

$d_{\mathrm{MAJ}}(i, j) = 1 - \frac{\big|R^*(i, k_1) \cap R^*(j, k_1)\big|}{\big|R^*(i, k_1) \cup R^*(j, k_1)\big|}$

This adjustment is further propagated to local query expansion and distance smoothing, ensuring that all reciprocal and expansion sets always contain a prescribed balance of intra- and inter-modality neighbors, thus rectifying the bias present in vanilla KNN-Jaccard approaches (Wang et al., 8 Dec 2025).

Jaccard in Multi-Label and Metadata Contexts

For multi-hot label vectors encoding modality and anatomy (or other attributes), as in medical imaging,

$y_i, y_j \in \{0, 1\}^{k} \implies S(i) = \{m \mid y_{i,m} = 1\}$

and the Jaccard similarity $J(y_i, y_j)$ is used either directly as a sample weight in contrastive learning or for positive pair selection; the distance is $1 - J(y_i, y_j)$ (Takaya et al., 26 Aug 2025).

2. Algorithmic Workflow and Variants

Core algorithmic steps for modality-aware Jaccard distance computation, as instantiated in unsupervised visible-infrared person re-identification (USVI-ReID), are summarized as follows (Wang et al., 8 Dec 2025):

Feature Extraction: Compute representations $f_\theta(x)$ for each sample.
Distance Matrix: Calculate pairwise distances $D(i, j)$ .
Modality-Balanced Neighbor Retrieval: For each $i$ , determine $N^*(i, k_1)$ with exactly $k_1/2$ intra- and $k_1/2$ inter-modality neighbors.
Reciprocal Set Formation: Construct $R^*(i, k_1)$ from $N^*(i, k_1)$ and reciprocal inclusion.
Local Query Expansion: Analogously define expanded sets with $k_2$ neighbors for denoised distance estimates.
Jaccard Calculation: Compute $d_{\mathrm{MAJ}}(i, j)$ on these reciprocal sets.
Global Clustering: Apply DBSCAN on the resulting $d_{\mathrm{MAJ}}$ affinity matrix to derive mixed-modality clusters.

This pipeline eliminates the spurious modality clustering prevalent in traditional, modality-agnostic schemes.

For metadata-aware or attribute-driven approaches (e.g., multimodal entity linking (Nguyen et al., 24 Jan 2025), medical image pretraining (Takaya et al., 26 Aug 2025)), the algorithm involves:

Encoding metadata as multi-hot or set-valued vectors.
Computing exact or thresholded Jaccard similarity/distance matrices between all pairs.
Using these matrices for hard negative selection in contrastive loss (entity linking) or as sample weights for positive pairs in multi-label supervised contrastive learning.

Visible-Infrared Person Re-Identification

The modality-aware Jaccard distance enables bias-mitigated clustering, overcoming the tendency of standard Jaccard reranking to form clusters aligned to single modalities rather than true identities. Modality-aware sets compress the inter-modal distance gap, producing clusters that reflect true identity-mixed distributions—demonstrated quantitatively by a +10% Rank-1 improvement on SYSU-MM01 (All search) when replacing standard with bias-mitigated global clustering (Wang et al., 8 Dec 2025).

Medical Imaging Representation Learning

In "ModAn-MulSupCon," the Jaccard distance between multi-hot vectors encoding modality and anatomy is employed as a soft weighting factor for supervised contrastive loss: $\mathcal{L}_{\mathrm{MulSupCon}}(a) = - \frac{1}{|P_\tau(a)|} \sum_{p \in P_\tau(a)} w_{ap} \log \frac{\exp(z_a^\top z_p / T)}{\sum_k \exp(z_a^\top z_k / T)}$ where $w_{ap} = J(y_a, y_p)$ , and $p$ is included as a positive if $w_{ap} \geq \tau$ (Takaya et al., 26 Aug 2025). This approach empowers effective use of metadata even in the absence of disease-level labels.

Multimodal Entity Linking

The JD-CCL framework in entity linking selects hard negatives for contrastive loss based on the Jaccard similarity of attribute sets (potentially including vision- and text-derived tokens), producing more semantically and structurally challenging negatives. By fine-tuning the balance between attribute sources, researchers can directly control the degree of modality awareness during hard negative mining (Nguyen et al., 24 Jan 2025).

4. Distinctions Relative to Standard Jaccard Methods

Standard Jaccard approaches indiscriminately compute neighbor sets based on global feature distances. In environments with strong modality, anatomy, or metadata-induced clustering, such approaches yield biased nearest neighbor sets—dominated by the majority modality or attribute, exacerbating the suppression of cross-modal associations (e.g., visible–visible dominating visible–infrared). Modality-aware variants explicitly correct this by:

Forcing neighbor sets to be balanced across modalities, attributes, or tags.
Ensuring that positive set construction and reciprocal-graph linkage cannot be dominated by any single modality.
Introducing set-based similarity into weighting schemes, as in contrastive learning, so that both the definitions of "hard positive" and "hard negative" are dynamically, metadata-aware.

In contrast, single-label or modality-ignorant variants are implicitly dependent on the bulk feature distribution, inheriting its biases and failing to link across modality-induced embedding gaps (Wang et al., 8 Dec 2025, Takaya et al., 26 Aug 2025).

5. Integration with Downstream Learning and Empirical Observations

The modality-aware Jaccard distance has proven effective in both clustering-centric and contrastive-centric pipelines:

Clustering: In unsupervised Re-ID, clusters derived via $d_{\mathrm{MAJ}}$ exhibit improved identity mixing and are less susceptible to local cluster errors propagated by optimal transport-based cross-modal association (Wang et al., 8 Dec 2025).
Contrastive Learning: By weighting positive pairs with Jaccard similarity and restricting to samples above a set threshold, learned representations reflect the semantic continuity encoded in multi-label metadata, outperforming instance-level and label-agnostic baselines when fine-tuned on downstream tasks (Takaya et al., 26 Aug 2025).
Negative Selection: Entity linking experiments confirm that hard negative selection via attribute-based Jaccard similarity yields higher Hit@1 and MRR compared to random or batch-based negative selection, with optimal performance at $k=6$ hard negatives (Nguyen et al., 24 Jan 2025).

No additional hyperparameters beyond those controlling neighbor-set sizes or Jaccard-positive thresholds are introduced; only modality or metadata labels are required.

Representative Empirical Results

Application	Vanilla Metric	Modality-Aware Jaccard	Gain
USVI-ReID Rank-1 (SYSU-MM01)	54.9%	64.9% (BMGC)	+10%
MEL Hit@1 (JD-CCL)	66.55	70.25 (k=6)	+3.70

6. Generalization and Implementation Guidance

Modality-aware Jaccard distance techniques generalize readily to any domain characterized by:

Multiple data modalities (e.g., vision, text, audio),
Multi-label or attribute-annotated datasets (e.g., device, view, tissue type),
Set-valued or multi-hot metadata.

Extensions of the core scheme may involve:

Alternative set-based similarity measures (e.g., Dice, overlap coefficient) in place of Jaccard, to reflect domain-specific priorities or to incorporate continuous metadata via analogous weighting schemes (Takaya et al., 26 Aug 2025).
Attribute weighting or pruning strategies to manage noise arising from generic or high-frequency meta-tags (Nguyen et al., 24 Jan 2025).
Efficient approximation techniques (MinHash, LSH) for negative sampling or neighbor-set construction when dealing with large-scale or high-cardinality attribute spaces.
Post-hoc tuning of modality-balancing weights or positive set thresholds to fine-tune task-specific performance.

These methods are particularly favored in situations with rich, high-quality metadata and in architectures or tasks where cross-modal association and invariance are paramount objectives.

7. Summary and Impact

Modality-aware Jaccard distance constitutes a precise, metadata-driven extension of the classical Jaccard index, systematically integrating modality or attribute awareness at the structural or set-theoretic level in neighbor retrieval, clustering, and loss weighting. By directly addressing the modality-induced distance biases that limit cross-modal and multimodal machine learning, these techniques achieve substantial performance improvements—demonstrated in unsupervised visible-infrared Re-ID, medical image representation learning, and multimodal entity linking. The underlying principles and implementation recipes support straightforward adaptation to new domains characterized by discrete, multi-label, or modality-rich metadata (Wang et al., 8 Dec 2025, Takaya et al., 26 Aug 2025, Nguyen et al., 24 Jan 2025).