Joint Generalized Cosine Similarity (JGCS)

Updated 3 January 2026

JGCS is a family of similarity measures that extends classical cosine similarity through convex cost functions, learned metric tensors, and hypervolume angles for multi-vector and multi-modal comparisons.
JGCS is applied in semantic alignment, cross-domain matching, and contrastive learning, enhancing tasks like word similarity, visual matching, and multi-modal classification.
JGCS improves task adaptation and computational efficiency by incorporating context-sensitive formulations that recover classical cosine similarity as a special case under specific conditions.

Joint Generalized Cosine Similarity (JGCS) is a family of similarity measures that generalize the classical cosine similarity to accommodate a broad range of data structures, learning contexts, and multi-vector comparisons. JGCS encompasses formulations based on convex cost functions (“Bregman cosine”), metric tensors, affine-quadratic forms for cross-domain matching, and the extension to joint similarity for $k$ vectors via hypervolume angles. These generalizations enable task-adapted, context-sensitive, and multi-modal alignment, surpassing the limitations of standard pairwise cosine similarity in both expressive power and computational efficiency.

1. Mathematical Formulations of JGCS

At its core, JGCS refers to any of several extensions of cosine similarity designed to capture richer geometric, statistical, and algebraic relationships.

1.1. Bregman (Convex Cost Function) JGCS

Given a convex cost function $C: \mathbb{R}^n \to \mathbb{R}$ , the similarity between two points $x,y \in \mathbb{R}^n$ is defined as

$\text{JGCS}_C(x,y) = \frac{\langle N_C(x), N_C(y) \rangle}{\|N_C(x)\| \|N_C(y)\|}$

where $N_C(x) = \nabla C(x)$ if $C$ is differentiable at $x$ , and otherwise any $g \in \partial C(x)$ , with practical schemes choosing $(g,h)$ to maximize the similarity (Gunay et al., 2014). Special cases include:

Negative entropy: $C(x) = \sum x_i \log x_i$ , $N_C(x) = (\log x_1+1, ..., \log x_n+1)$
Total variation: $C(x) = \sum |x_{i+1}-x_i|$ , $N_C(x)$ as the signed finite differences
Euclidean: $C(x) = \|x\|^2$ , $N_C(x) = 2x$ , recovering ordinary cosine similarity

1.2. Metric Tensor (Learned Inner Product) JGCS

Let $M \in \mathbb{R}^{d \times d}$ be a learned symmetric positive semi-definite matrix. For $x,y \in \mathbb{R}^d$ ,

$S_M(x,y) = \frac{x^T M y}{\sqrt{x^T M x}\sqrt{y^T M y}}$

with $M = B^T B$ parameterized via a learned linear map $B$ (typically $r \leq d$ ) to ensure $M$ is PSD (Vos et al., 2022). This enables context-sensitive and data-driven geometric adaptation.

1.3. Affine-Quadratic Form JGCS for Cross-Domain Matching

For $x \in \mathbb{R}^n$ , $y \in \mathbb{R}^m$ (possibly high-level features from different domains), JGCS is defined as a quadratic form: $S(x,y) = \begin{pmatrix} x^T & y^T & 1 \end{pmatrix} \begin{bmatrix} A & C & d \ C^T & B & e \ d^T & e^T & f \end{bmatrix} \begin{pmatrix} x \ y \ 1 \end{pmatrix}$ where $A, B$ are PSD, $C$ is the cross-domain matrix, $d,e$ affine terms, and $f \in \mathbb{R}$ (Lin et al., 2016). This can be decomposed into a weighted sum of affine Mahalanobis distance and affine cosine terms, subsuming both as special cases.

For $k$ vectors $\{\mathbf{f}_1,\dots,\mathbf{f}_k\} \subseteq \mathbb{R}^D$ , the JGCS is defined by

$M = \begin{bmatrix} \mathbf{f}_1^T \ \vdots \ \mathbf{f}_k^T \end{bmatrix}, \qquad G = M M^T, \qquad V = \sqrt{\det(G)}$

$\Theta = \arcsin\left(\frac{V}{\prod_{i=1}^k \|\mathbf{f}_i\|}\right), \qquad \text{JGCS}(\mathbf{f}_1,\ldots,\mathbf{f}_k) = \cos(\Theta)$

This recovers standard cosine similarity for $k=2$ but enables well-defined joint similarity over any number of vectors (Chen et al., 6 May 2025).

2. Theoretical Properties and Special Cases

JGCS possesses key invariances and compatibility with geometric intuition:

Rotation invariance: Under orthogonal transformations, both the Bregman and hypervolume-based JGCS remain unchanged.
Downward compatibility: The hypervolume formulation yields $\cos\theta$ for $k=2$ ; the metric-tensor and convex cost function approaches recover standard cosine when $C(x) = \|x\|^2$ or $M=I$ .
Permutation symmetry: Joint similarity is invariant to the order of vectors up to sign (hypervolume angle) (Chen et al., 6 May 2025).
Sensitivity to data semantics: By choosing $C$ , $M$ , or the affine maps appropriately, JGCS can emphasize distribution shape (via negative entropy), edge/variation structure (total variation), or task/context-specific axes (learned $M$ / $B$ ) (Gunay et al., 2014, Vos et al., 2022).
Degeneracy/extrema: For linearly dependent vectors (in hypervolume JGCS), $\det(G)=0$ , JGCS=1; for mutually orthogonal vectors, JGCS=0.

3. Algorithmic and Computational Considerations

Most JGCS variants introduce computational overhead compared to plain cosine, but retain tractability and favorable scaling.

Variant	Main Computational Cost	Key Notes
Bregman/convex cost (Gunay et al., 2014)	$O(n)$ gradient/subgradient eval	Subgradient selection for nondifferentiable $C$
Metric tensor (Vos et al., 2022)	$O(d^2)$ for $B\in \mathbb{R}^{d\times d}$	PSD enforced via parameterization
Affine-quadratic form (Lin et al., 2016)	$O(r n)$ / $O(r m)$ weights in deep net	End-to-end learning via hinge loss
Hypervolume/k-modal (Chen et al., 6 May 2025)	$O(k^2 D + k^3)$ per k-tuple	Gram det via Cholesky; efficient for $k\le12$

The hypervolume JGCS's $O(k^3)$ scaling remains practical for typical $k$ , while avoiding the combinatorial redundancy of all possible pairwise cosine computations as $k$ increases (Chen et al., 6 May 2025). For nondifferentiable convex $C$ , subgradient maximization may add further cost, though the similarity reduces to standard norms and inner products in smooth cases (Gunay et al., 2014).

4. Applications

JGCS has been applied across embedding evaluation, visual domain-matching, and multi-modal contrastive learning:

Word/semantic similarity: Metric-tensor JGCS (learned $M$ ) achieves higher correlation with human ratings than standard cosine, especially with contextualized BERT/GPT-2 embeddings and task-adapted context metrics (Vos et al., 2022).
Cross-domain visual matching: The affine-quadratic JGCS enables domain-robust matching (e.g., matching ID photos to surveillance images) within a single end-to-end learned deep network, blending Mahalanobis and cosine terms (Lin et al., 2016).
N-way semantic alignment and contrastive learning: The hypervolume-based JGCS admits direct joint alignment of arbitrary $k$ modalities, reducing the need for all pairwise losses, enhancing convergence, semantic collapse, and noise robustness in multi-modal datasets (e.g., clinical/dermatoscopic/image metadata) (Chen et al., 6 May 2025).
Kernel and manifold learning: The Gram-hypervolume angle has been suggested for multi-vector kernels, regularization, and sensor fusion (Chen et al., 6 May 2025).

5. Empirical Performance and Practical Impact

Empirical evaluations document consistent performance improvements from JGCS-based methods:

Contextual word similarity: Relative correlation gains (Spearman/Pearson) of +150–652% over plain cosine similarity on SimLex-999 and task-derived datasets with BERT-class embeddings (Vos et al., 2022).
Cross-modal vision tasks: Superior retrieval/identification accuracy over existing state-of-the-art on re-identification and multimodal face verification tasks (Lin et al., 2016).
Multi-modal medical classification: On the Derm7pt tri-modal skin lesion dataset, GHA Loss (based on hypervolume JGCS) outperforms dual/pairwise-objective baselines by +2–5% accuracy and up to +5.4 points macro-F1, with enhanced robustness to injected Gaussian noise (mean error grows linearly with noise STD up to $\sigma=0.1$ ) (Chen et al., 6 May 2025).
Computational efficiency: For practical $k$ , GHA Loss remains faster than explicit pairwise InfoNCE, with $O(k^3)$ scaling and minimal additional cost for determinants (Chen et al., 6 May 2025).

6. Limitations and Possible Extensions

Limitations observed include:

Metric tensor (contextual) JGCS: Linear transformations may be insufficient for highly nonlinear context adaptation; parameter counts can grow rapidly with dimensionality (Vos et al., 2022).
Bregman/convex JGCS: Subgradient choice is nontrivial for nonsmooth $C$ and may have ambiguous optimality (Gunay et al., 2014).
Hypervolume (N-modal) JGCS: The Gram determinant is numerically sensitive for nearly collinear vectors but can be regularized by shift $\epsilon I$ (Chen et al., 6 May 2025).

Suggested extensions:

Nonlinear context metrics $M(x)$ for capturing more complex data dependencies (Vos et al., 2022).
Low-rank or sparse $B$ for computational/regularization benefits.
Kernelization for multi-modal learning, e.g., $\exp(-\gamma\Theta)$ (Chen et al., 6 May 2025).
Eigendecomposition for interpretability of learned metrics (Vos et al., 2022).
Application to multi-target tracking, sensor fusion, and attention gating (Chen et al., 6 May 2025).

7. Relationship to Classical Cosine and Other Similarities

JGCS constitutes a strict generalization of cosine similarity. For convex $C(x) = \|x\|^2$ , learned metric $M = I$ , or $k=2$ in the hypervolume angle, all JGCS variants reduce exactly to the ordinary cosine. For more structured or context-sensitive data, the measure embodies rich task and data-dependent inductive biases, strictly increasing discriminative and alignment capacity over traditional cosine, Euclidean, or Mahalanobis comparisons (Gunay et al., 2014, Vos et al., 2022, Lin et al., 2016, Chen et al., 6 May 2025).

A plausible implication is that as the complexity and heterogeneity of data modalities increases, JGCS's joint, geometry-aware, and context-adaptive formalism will continue to supplant pairwise similarity techniques in both supervised and unsupervised settings.