Alignment-Only Contrastive Learning

Updated 18 November 2025

Alignment-only contrastive learning is a paradigm that exclusively minimizes the distance between aligned positive pairs across different modalities, omitting repulsive terms.
It introduces innovative formulations such as Joint Generalized Cosine Similarity (JGCS) and GHA Loss to jointly optimize multi-modal embeddings efficiently.
Empirical results demonstrate improved metrics in multi-modal tasks, though challenges like feature collapse and limited inter-class separation remain critical considerations.

Alignment-only contrastive learning is a paradigm in which the optimization objective is restricted to bringing together positive (aligned or semantically related) samples, typically across modalities or augmented views, without explicit inclusion of separation (repulsive) terms or auxiliary tasks. This approach centers the learning process exclusively on minimizing alignment distances or maximizing similarity between matching pairs or sets, discarding generative, reconstruction, or class-discriminative supervision. The recent literature elaborates formulations, theoretical underpinnings, extensions, empirical advantages, and notable limitations of this pure alignment regime in a variety of domains and modalities (Chen et al., 6 May 2025, Wang et al., 2023, Liu et al., 2022).

1. Mathematical Foundations and Novel Formulations

Traditional contrastive objectives, such as InfoNCE, average pairwise similarities (typically dot products or cosine similarities) between representations from two views or modalities, balancing an alignment term (attract positives) and a uniformity or repulsion term (separate negatives). Alignment-only contrastive learning operates by restricting to the alignment component, omitting explicit negative-pair repulsion (Chen et al., 6 May 2025, Wang et al., 2023).

A pivotal innovation is the Joint Generalized Cosine Similarity (JGCS), defined for an arbitrary collection of $n$ modality embeddings $\{\mathbf{f}_1, \ldots, \mathbf{f}_n\} \subset \mathbb{R}^D$ by:

$\mathrm{JGCS}(\mathbf{f}_1, ..., \mathbf{f}_n) = \cos \Theta_{\mathbf{f}_1,\ldots,\mathbf{f}_n}$

$\Theta_{\mathbf{f}_1,\ldots,\mathbf{f}_n} = \arcsin\Bigg( \frac{ \sqrt{\det(\mathbf{M} \mathbf{M}^T)} }{ \prod_{i=1}^n \| \mathbf{f}_i \|_2 } \Bigg )$

where $\mathbf{M} \in \mathbb{R}^{n \times D}$ stacks the $\mathbf{f}_i$ row-wise and the Gram determinant quantifies the hypervolume spanned by the set. JGCS reduces to cosine similarity for $n=2$ and is invariant to rotations, reflections, and modally symmetric. The JGCS-based loss, termed GHA Loss, aligns all $n$ modalities jointly in a single forward computation, providing gradient signals based on their collective geometry and eliminating the combinatorial cost of all $\binom{n}{2}$ pairwise interactions (Chen et al., 6 May 2025).

The general form of the alignment-only loss in this context is:

$\mathcal L_{C} = -\frac1B \sum_{i=1}^B \log \frac{ \exp(\cos \Theta_{\text{pos}^{(i)}}/\tau) } {\exp(\cos \Theta_{\text{pos}^{(i)}}/\tau) + \sum_{j=1}^K \exp(\cos \Theta_{\text{neg},j}^{(i)} /\tau)}$

augmented by an angular-equilibrium regularization $\mathcal L_A$ to ensure uniform convergence across all induced pairwise similarities within the set, yielding the total loss:

$\mathcal L_{\mathrm{GHA}} = \mathcal L_C + \lambda \mathcal L_A$

where all gradient flow derives from the alignment signal (Chen et al., 6 May 2025).

2. Theoretical Properties and Dynamics

Alignment-only losses can be cast as graph-Laplacian regularizers. Let $L$ be the normalized Laplacian of a positive-pair adjacency graph (e.g., induced by augmentations or cross-modal pairings), and $F \in \mathbb{R}^{N \times m}$ the embedding matrix. The alignment loss:

$L_{\text{align}}(F) = \operatorname{Tr}(F^T L F) = \frac12 \mathbb{E}_{x,x^+ \sim A} [ \|f_\theta(x) - f_\theta(x^+)\|^2 ]$

under gradient descent, propagates as a message-passing operator on the graph, homogenizing the features of all connected positive pairs. Absent a uniformity or negative-pair repulsion term, this "synchronizes" all positives onto low-dimensional subspaces, potentially collapsing them to a trivial point or class manifold. Rigorous analysis shows that, for class-connected subgraphs, repeated steps exponentially concentrate intra-class features, but fail to create inter-class separation without further constraints (Wang et al., 2023).

Similarly, recent works reinterpret contrastive learning as an entropic optimal transport (OT) alignment: InfoNCE emerges as a one-step proximal projection onto the row-normalized simplex, minimizing KL-divergence between the empirical matching plan and the identity coupling; multistep generalizations (GCA-INCE) and unbalanced relaxations (GCA-UOT) yield finer alignment controls and robustness, but pure alignment without separation may not capture global structure (Chen et al., 27 Feb 2025).

3. Empirical Performance, Applications, and Extensions

In multi-modal scenarios with $n>2$ modalities, JGCS with GHA Loss consistently outperforms pairwise InfoNCE and "Normal" (no alignment) across a range of metrics (accuracy, AUC, F1) and model architectures. For example, on Derm7pt with ResNet-50+Concat, top-1 accuracy increases from 58.07% (Normal), 60.94% (Dual), to 62.76% (GHA); macro F1 similarly improves from 0.3346 to 0.3934 (Chen et al., 6 May 2025).

Alignment-only contrastive paradigms have been successfully instantiated in:

Multi-modal alignment: Dermoscopic images, clinical photos, metadata on medical datasets (Chen et al., 6 May 2025).
Cross-modal retrieval and alignment: Audio-to-lyrics alignment, achieving sub-0.2 second average absolute error and $\sim$ 92–94% correct word onsets, outperforming CTC and hybrid models, and showing strong cross-lingual generalization (Durand et al., 2023).
Human preference alignment for LLMs: Margin-based, pairwise contrastive objectives enforce preference ordering with stability, simplicity, and performance surpassing reward-learning RLHF baselines (Fang et al., 25 Mar 2024).
Debiasing and safety: Contrastive fine-tuning on LLMs reduces toxicity while improving factual accuracy, uniquely avoiding the common trade-off of degraded knowledge (Korkmaz et al., 25 May 2025).
Self-supervised entity alignment: Explicit pseudo-pair pulling loss in entity-entity alignment across unpaired knowledge graphs raises Hits@1 by $9\%$ over previous self-supervised SOTA (Zeng et al., 2022).

Key computational and practical properties include linear scaling in the number of modalities, robust convergence under moderate noise, and the ability to process weak annotations (e.g., bag-of-symbols for lyrics alignment, or pseudo-labels for segmentation) without fragile path constraints or large memory banks (Chen et al., 6 May 2025, Durand et al., 2023, Tang et al., 2021).

4. Limitations and Known Pathologies

Despite the efficiency and focus of alignment-only training, several fundamental limitations arise:

Feature collapse: Absent a uniformity term, alignment-only methods can collapse representations, failing to create discriminative or well-separated clusters in embedding space (Wang et al., 2023, Liu et al., 2023). Empirically, in graph domains, perfect alignment degrades inter-class margins, reducing classification accuracy—even as alignment loss itself improves.
Reduced feature diversity: Strict element-wise or neuron-wise alignment curtails activation diversity and can limit generalization. Over-alignment suppresses multiple "redundant" concept detectors, leading to narrow pathways through the network. This has measurable effects on hyperspherical energy and neuron coverage, addressed by higher-level concept clustering techniques (Liu et al., 2022).
Absence of global structure: Alignment-only objectives do not directly enforce between-class or between-sample repulsion, which is essential to maintain semantic separation and avoid oversmoothing.
Asymptotic suboptimality: Tight alignment of positives may conflict with downstream tasks that depend more on inter-class separation than intra-class collapse; optimizing for perfect alignment is "poisonous" for generalization (Liu et al., 2023).

5. Advances, Remedies, and Hybrid Extensions

Recent work addresses these alignment-only pitfalls by several extensions:

Concept-level alignment: Aggregating neurons into concept clusters (as in CoCo) allows alignment at a higher semantic level, enhancing diversity, neuron utilization, and performance across domain generalization and synthetic-to-real tasks (Liu et al., 2022).
Approximate optimal transport: Multi-step or unbalanced entropic OT yields more flexible alignment plans, allows for partial matching, robustness to noise, and explicit domain or structure control (Chen et al., 27 Feb 2025).
Regularization and hybrid supervision: Adding angular equilibrium penalties, or fusing alignment with focal/classification losses, enforces balanced convergence while still enabling class discrimination (Chen et al., 6 May 2025).
Importance- and spectrum-aware augmentation: In graph contrastive learning, augmentations designed to preserve mutual information while permitting substantive alignment distances avoid the deleterious effects of overalignment. Adaptively dropping low-importance features or smoothing the Laplacian spectrum facilitates larger inter-class gaps and improved downstream performance (Liu et al., 2023).

6. Broader Implications and Prospects

Alignment-only contrastive learning reveals the centrality of the alignment process in various domains—multi-modal perception, LLM preference correction, self-supervised knowledge graph matching, and cross-modal retrieval. Its modularity enables plug-and-play redesign of loss functions in existing architectures, and the theoretical interpretations via OT and Laplacian regularization provide a principled foundation for future refinements.

Nevertheless, best practices now combine alignment with controlled separation, higher-level abstraction, or learned constraints on permissible matching, leveraging the strengths of both paradigms. Alignment-only objectives serve as a strong analytical baseline and, via innovations like JGCS and OT-based formulations, anchor the development of scalable, robust, and theoretically grounded contrastive frameworks (Chen et al., 6 May 2025, Chen et al., 27 Feb 2025).