Similarity-Based Feature Engineering

Updated 16 January 2026

Similarity-based feature engineering is a methodology that designs adaptive feature spaces where learned similarity functions capture semantic and geometric data structures.
It employs techniques like manifold mining, low-rank decomposition, and adversarial strategies to overcome the limitations of traditional Euclidean metrics.
By integrating structure recovery with adaptive similarity learning, these methods boost model accuracy and interpretability across domains such as vision, chemistry, and multi-modal analysis.

Similarity-based feature engineering encompasses a family of methodologies for constructing, selecting, or transforming feature representations such that the resulting similarities in feature space reflect meaningful relationships for the task at hand. Rather than relying solely on raw distance metrics or handcrafted attributes, these approaches systematically seek to induce feature spaces where the geometric structure—often measured by learned or task-driven similarity functions—aligns with domain-specific or semantic structure.

1. Foundational Motivation and Limitations of Naive Similarity

The central premise of similarity-based feature engineering is that standard distance functions (often Euclidean) are limited in complex, high-dimensional, or heterogeneous data regimes. In vision, for example, the ambient Euclidean geometry of deep features is poorly aligned with semantic relationships: manifold “folding” can place instances of the same class far apart, while heterogeneity in local densities leads to failures in neighborhood-based supervision and hard-sample mining (Wang et al., 2021, Huang et al., 2016). In chemistry, naive Euclidean distances between descriptors may obscure relevant physicochemical groupings or transitional invariants (Thygesen et al., 2022).

Failures of global metrics in high-variance or imbalanced density regions are widely documented:

Large-k nearest neighbor sets under global distance tend to include numerous false positives.
Restricting neighborhoods to very small k maintains higher semantic precision but yields weak, uninformative pseudo-labeling and poorly calibrated learning signals (Wang et al., 2021).
In local metric learning and clustering, the lack of structural adaptivity leads to collapse of intra-class cohesion or fluidity across cluster boundaries (Huang et al., 2016, Wang et al., 2011).

Similarity-based feature engineering frameworks are motivated by the need to (i) model the intrinsic manifold or graph structure underlying sample distributions, (ii) perform adaptive metric or similarity learning—often in conjunction with feature projection—and (iii) construct feature spaces supporting rich, high-precision pseudo-supervision for self-supervised, unsupervised, or hybrid supervised learning.

2. Similarity Mining and Structure Recovery on the Feature Manifold

A key innovation is the explicit mining or estimation of latent geometric or semantic structure, typically through manifold learning, low-rank decomposition, or adversarial proxy generation.

Instance Similarity Learning (ISL) (Wang et al., 2021) deploys a generative adversarial network in embedding space to mine the local feature manifold. A memory bank is maintained over all embeddings, with an accompanying binary similarity matrix. For each anchor, ISL synthesizes “proxy” features on the manifold boundary between current positives and negatives, using a generator-discriminator game over sampled triplets. The optimal proxy—judged by the discriminator’s confidence and an $\ell_2$ proximity threshold—enables dynamic, reliable enlargement of positive sets. This adversarial mining directly counteracts the Euclidean failure by exploring local geodesics and boundaries.

Closed-form non-linear low-rank representations (KLRR/LRR) (Wang et al., 2011) recover block-diagonal structure in the affinity matrix, revealing latent manifold affiliations. By decomposing the kernel matrix and shrinking its spectrum, a “structural” similarity measure is obtained that captures both global data geometry and class-manifold separability.

In both paradigms, the process of manifold mining replaces static, distance-based neighborhoods with adaptively learned, semantically aligned proximity sets, yielding more effective supervised signals or unlabelled embedding anchors.

3. Formulations of Learned or Adaptive Similarity

A core class of methods directly learns local or global similarity metrics, often in tandem with feature selection or embedding.

Local Similarity-Aware Embedding (Huang et al., 2016): The Position-Dependent Deep Metric (PDDM) learns a score $S_{ij}$ for each feature pair using a deep, position-aware transformation that incorporates both absolute (mean) and relative (difference) statistics between normalized embeddings. PDDM serves as a learned similarity head plugged into the main network, whose outputs guide hard-mining and large-margin learning.
Adaptive Similarity Graphs in Multi-Modal Selection (Shi et al., 2020): Here, both the feature-selection projection $W$ (across multiple data modalities) and the similarity matrix $S$ (subject to sparsity and class-consistency constraints) are learned jointly, with $S$ dynamically induced by class-conditional local proximity in the projected space. This iterative updating sharpens both the selected feature set and the affinity structure toward maximal discriminability.
Functional Equivalence and Feature Complexity (Chen et al., 2023): Beyond numeric similarity, this approach defines functional equivalence among neural features via the existence of invertible transformations yielding identical outputs. A metric on “feature complexity” at the level of equivalence posets captures the true number of independent features, with closed-form, data-free iterative merging (IFM) to eliminate redundant representations.
Hybrid Feature-Kernel Models (HFSM) (Kueper et al., 2022): By combining explicit feature vectors and kernel (similarity) terms, penalized for kernel sparsity and fit via convex optimization, HFSMs allocate predictive power to either standard features or learned similarity blocks. In EHR applications, structure-aware Jaccard kernels (sensitive to rare-presence and common-absence events) are constructed and combined with interpretable linear terms.

4. Similarity-Based Feature Construction, Embedding, and Selection

Feature engineering via similarity typically encompasses:

Landmark-based Embeddings (Kar et al., 2011, Wang et al., 2011): A canonical map $\phi(x) = [f(K(x,\ell_1^+) - K(x,\ell_1^-)), \dots ]$ embeds each sample relative to a selected set of diverse landmark pairs (or single points in the structural case), with the transfer function $f$ optimized for task adequacy. The process includes diversity-driven (DSELECT) heuristics to maximize coverage of the input space and reduce redundancy among landmarks.
Similarity-described Descriptors in Science (Thygesen et al., 2022): In computational chemistry, feature vectors are constructed from physically motivated descriptors (transition fluxes, topological invariants, histogrammed scatterplots) and assessed via pairwise distance distributions and alignment to known labels or clusters. The empirical clustering and rank-order correlation of distances across competing representations inform engineering choices.
Adaptive Similarity Matrices with Feature Sparsity (Shi et al., 2020): In multi-modality analysis, feature selectors and similarity graphs are optimized jointly. Similarity $S_{ik}$ is defined for $K$ nearest class-neighbors (by learned projected metric), and achievability is ensured via convex alternating minimization and explicit sparsification by an $\ell_{2,1}$ penalty.
Similarity-derived Statistical Aggregates in Forecasting (Wu et al., 2022): For large-scale, sparse prediction (traffic), KNN-based similarity is computed over a masking-corrected or imputed global state vector; for each query, statistics (mean, std, percentiles) are extracted from the $k$ -nearest neighbor set (based on chosen metric) and concatenated with standard features, conveying latent global structure into the learner.

5. Methodological Implementation and Algorithmic Schemes

The operational pipelines follow similar stages: similarity definition or learning, construction/mining of neighborhoods or affinity graphs, embedding/feature construction, and integration into training. Representative high-level pseudocode is given for key approaches:

ISL (Manifold Mining + Feature Learning) (Wang et al., 2021):

for round = 1 to R:
    for each anchor i:
        sample triplets (f_i, f_i^p, f_i^n)
        generate proxies f_i^g with G
        optimize GAN loss for (G, D)
        update positives with best f_i^{g*}
    for each minibatch:
        extract features, update memory
        compute similarity probabilities (cosine-softmax)
        compute total loss L_sim, backpropagate

Adaptive Similarity + Sparse Selection (ASMFS) (Shi et al., 2020):

Repeat for T iterations:
    Update W via closed-form, fixing S
    Update group sparsity weights (l2,1)
    For each sample i, update S[i,:]
       Compute projected distances, KNN
       Closed-form quadratic update for S_i
Until convergence
Return nonzero rows of W and learned S

Landmark Embedding + Diversity Heuristic (Kar et al., 2011):

Initialize L = {random sample}
For j = 2 to d:
    Select z minimizing average similarity to L
    L.append(z)
Construct embedding phi(x) via chosen landmarks and transfer f
Train linear classifier on phi

Best-practice guidelines encompass batch-size, margin, and loss hyperparameters (Huang et al., 2016), alternating minimization tolerances, neighborhood size for affinity graphs (Shi et al., 2020), and landmark-set size in embedding methods (Kar et al., 2011).

6. Empirical Outcomes, Diagnostics, and Selection Criteria

Similarity-based feature engineering routinely yields substantial gains over naively engineered features and even over black-box deep models in settings where geometric structure or task-driven similarity is misaligned with fixed metrics.

In unsupervised vision, ISL increases large-neighborhood positive precision from ~50% (raw Euclidean) to >80% (learned manifold), yielding +6–8% absolute top-1 accuracy in CIFAR/ImageNet (Wang et al., 2021).
In deep metric learning, PDDM-based local similarity delivers +10–15 points in retrieval R@K and enables state-of-the-art transfer/zero-shot performance (Huang et al., 2016).
In multi-modality medical analysis, ASMFS realizes +2–3% accuracy over classical multi-kernel baselines by enforcing geometric alignment and joint sparsification (Shi et al., 2020).
In traffic forecasting, similarity-based aggregate features account for 14 of the top 20 features by total gain and reduce MAE by >4% (Wu et al., 2022).
In interpretability-focused domains (e.g., binary code analysis), purely feature-engineered similarity achieves AUC >0.95, matching or exceeding deep methods while affording insight into feature-source impact and cross-variant transferability (Kim et al., 2020).

Diagnostics highlighted in these frameworks include the shape of pairwise distance distributions (e.g., bi-modality for strong clusters), rank-order correlation and Jaccard overlap of neighbor sets, and ablation studies on the contribution of similarity-derived features versus hand-crafted or static ones (Thygesen et al., 2022).

7. Extensions, Generalization, and Open Directions

Similarity-based feature engineering is extensible beyond computer vision or raw data domains. Generalizations and adaptation points include:

Non-vision domains: Domains with high-dimensional, sparse, or multi-modal data—such as EHR, traffic, or scientific ensemble data—benefit from both explicit domain-informed and similarity-mined features (Kueper et al., 2022, Thygesen et al., 2022).
Integration in neural architectures: Paired feature similarity modules, learned proxies, or even transformer architectures can include local or global similarity heads to drive training (Wang et al., 2021, Zhou et al., 2024).
Unsupervised/weakly-supervised adaptation: Methods that mine structure or learn similarity locally can be embedded in self-supervised or weakly supervised pipelines, retaining the benefits of robust pseudo-labels and improved data efficiency (Wang et al., 2021, Wang et al., 2011).
Open problems: Incorporating domain priors on similarity, learning cross-modal similarity functions, and scaling similarity mining to very large or graph-structured datasets remain open challenges.

Empirical evidence confirms that explicitly aligning feature engineering with the geometry of task-relevant similarity, whether via adaptive graphs, manifold mining, hybrid models, or embedding constructions, yields substantial improvements in learning, robustness, and interpretability (Wang et al., 2021, Huang et al., 2016, Shi et al., 2020, Thygesen et al., 2022, Kar et al., 2011, Wu et al., 2022).