Rebalanced Contrastive Loss (RCL)

Updated 19 April 2026

The paper introduces RCL, which rebalances supervised contrastive learning by incorporating explicit class-prior weighting, balanced sampling, and margin-regularized structures.
RCL addresses tail-class collapse by ensuring equitable representation in the embedding space, leading to a 3–5 point improvement in macro-F1 and balanced accuracy on benchmarks.
Empirical studies demonstrate that RCL, utilizing prototype augmentation and hard pair mining, achieves robust and efficient performance in both vision and NLP tasks.

Rebalanced Contrastive Loss (RCL) is a family of loss formulations that address the limitations of conventional supervised contrastive learning under severe class imbalance, particularly for long-tailed classification. RCL generalizes standard contrastive loss by incorporating explicit reweighting, balanced sampling, and margin-regularized structures to ensure that all classes—including tail classes—achieve equitable representation and discriminative power in the embedding space. The approach has produced substantial advances for long-tailed vision and text classification by correcting the tendency of vanilla contrastive learning to collapse tail-class clusters and bias decision boundaries in favor of head classes (Li et al., 2024, Alvis et al., 2023, Zhong et al., 2022, Sors et al., 2021).

1. Motivation: Contrastive Learning Under Imbalance

Standard supervised contrastive losses assign equal contribution to all observed pairs within a batch, leading to a quadratic amplification of class imbalance: the relative abundance of head class samples results in a vastly greater number of positive and negative pairs for head classes compared to tails. For a dataset with class frequencies $\pi_k$ , contrastive pair imbalance reaches $\gamma \approx \left(\frac{\pi_{\max}}{\pi_{\min}}\right)^2$ (Zhong et al., 2022). This imbalance skews the embedding geometry, resulting in diffused or collapsed tail-class clusters, unstable gradients, and poor tail-class accuracy (Alvis et al., 2023, Zhong et al., 2022).

Classical remedies, such as weighted cross-entropy or balancing in the classifier’s output space, are insufficient within the contrastive regime, because they do not address the underlying pairwise sample distribution or the vanilla loss’s implicit bias in the embedding space (Alvis et al., 2023).

2. Core Principles and Mathematical Formulation

RCL modifies supervised contrastive loss by three main mechanisms: explicit class-prior weighting in the softmax, adaptive (often prototype-based) anchor/target sampling, and—in some formulations—embedding regularization or margin-based adjustments.

Given a batch $\mathcal{B}$ , let $z_i$ be the embedding for sample $i$ , $y_i$ its label, and $\mathcal{B}_c$ the indices in $\mathcal{B}$ belonging to class $c$ . RCL augments the standard supervised contrastive loss: $\mathcal{L}_\text{SCL} = -\frac{1}{|\mathcal{P}_i|} \sum_{p \in \mathcal{P}_i} \log \frac{\exp(z_i \cdot z_p / \tau)}{\sum_{a \in \mathcal{A}_i} \exp(z_i \cdot z_a / \tau)}$ by introducing class-frequency-modulated terms and/or modifying the sampling of positive and negative pairs. For example, the RCL in (Alvis et al., 2023) with class-frequency weighting and feature compression is: $\gamma \approx \left(\frac{\pi_{\max}}{\pi_{\min}}\right)^2$ 0 where $\gamma \approx \left(\frac{\pi_{\max}}{\pi_{\min}}\right)^2$ 1 is the global count of class $\gamma \approx \left(\frac{\pi_{\max}}{\pi_{\min}}\right)^2$ 2, $\gamma \approx \left(\frac{\pi_{\max}}{\pi_{\min}}\right)^2$ 3 is the prototype for class $\gamma \approx \left(\frac{\pi_{\max}}{\pi_{\min}}\right)^2$ 4, and $\gamma \approx \left(\frac{\pi_{\max}}{\pi_{\min}}\right)^2$ 5 is scaled by a class-dependent scalar for compactness.

Beyond weighting, other RCL variants incorporate balanced or queue-based sampling (Zhong et al., 2022), logit-adjustments (Li et al., 2024), and two-term decompositions with separate positive/negative loss weights (Sors et al., 2021).

3. Architectural and Sampling Strategies

RCL is typically implemented in multi-branch architectures, coupling a classification branch (with cross-entropy or its logit-adjusted variant) to a contrastive branch sharing class-level information via learnable prototypes (Li et al., 2024, Alvis et al., 2023). Key strategies include:

Prototype Augmentation: Each class is assigned a learnable prototype vector, ensuring that even if no batch samples are present for a tail class, it is still represented in contrastive computations (Li et al., 2024, Alvis et al., 2023).
Balanced Anchor and Target Sampling: Each class contributes an equal number of positive and negative samples, either by uniform sampling, prototype injection, or using class-balanced queues (Li et al., 2024, Zhong et al., 2022).
Hard Pair Mining and Synthetic Augmentation: Hard positive/negative pairs with low similarity (for same class) or high similarity (for different classes) are identified to combat gradient vanishing, and in some cases are further augmented via Mixup interpolation (Li et al., 2024, Zhong et al., 2022).
Class-Prior Weighted Softmax: The logits in the contrastive softmax are adjusted with class frequency terms, directly correcting for data imbalance at the pairwise similarity level (Alvis et al., 2023).

4. Optimization, Hyperparameters, and Implementation

Effective application of RCL depends on suitable tuning of weighting, margin, and sampling hyperparameters. Key settings include:

Weighting Coefficients: Separate coefficients or implicit logit-adjustments for each class or sub-loss component (positive, negative/entropy) (Alvis et al., 2023, Sors et al., 2021).
Temperature $\gamma \approx \left(\frac{\pi_{\max}}{\pi_{\min}}\right)^2$ 6: Governs the scale of pairwise similarities in the softmax. Tuning is generally orthogonal to rebalancing (Li et al., 2024, Alvis et al., 2023).
Batch Size and Sampling Scheme: Batch size interacts with the inherent imbalance in pairwise combinations; explicit balancing or HPO over batch size is recommended (Sors et al., 2021).
Prototype and Hard-Mixup Pool Sizes: The number of hard positives/negatives and the Mixup interpolation strength impact gradient diversity and convergence stability (Li et al., 2024).
Joint Loss Formulation: The overall objective combines classification (logit-adjusted cross-entropy) and RCL, typically with equal or tuned weights (Li et al., 2024, Alvis et al., 2023).

Typical implementation involves a BERT-based encoder for text or a ResNet/ViT backbone for vision, with multi-layer MLPs for projection heads and separate learned prototypes per class (Li et al., 2024, Alvis et al., 2023, Zhong et al., 2022).

5. Empirical Findings and Comparative Analysis

Experiments on both vision and NLP datasets with extreme imbalance ratios (e.g., R52 $\gamma \approx \left(\frac{\pi_{\max}}{\pi_{\min}}\right)^2$ 7, CIFAR100-IF=100) consistently show that RCL-equipped methods outperform standard contrastive and cross-entropy baselines by $\gamma \approx \left(\frac{\pi_{\max}}{\pi_{\min}}\right)^2$ 83–5 points in macro-F1 and top-1 balanced accuracy, particularly benefitting tail or medium-frequency classes (Li et al., 2024, Alvis et al., 2023, Zhong et al., 2022).

Ablation studies confirm that omitting balanced sampling, hard pair augmentation, or class-prior weighting leads to significant drops (up to 3 points macro-F1 or 2% top-1) (Li et al., 2024, Alvis et al., 2023). Inference-time efficiency remains high, with RCL models processing datasets such as Ohsumed ( $\gamma \approx \left(\frac{\pi_{\max}}{\pi_{\min}}\right)^2$ 93K samples) in 21 seconds per epoch on a single high-end GPU, in contrast to LLM baselines with three orders of magnitude greater resource demand (Li et al., 2024).

6. Connections and Variants in the Literature

Multiple independent lines of research have arrived at RCL or closely related objectives, including:

SharpReCL (Li et al., 2024): Focuses on rebalancing supervised contrastive loss for NLP by sharing learnable prototypes and employing “Simple Sampling” and “Hard-Mixup” pools to ensure balanced and challenging pair constructions.
ResCom (Zhong et al., 2022): Highlights the dual imbalance at batch and memory (queue) levels, introducing class-balanced queues and effective pair mining to recover stable gradients.
Softmax Weighted RCL (Alvis et al., 2023): Adopts global class-prior weighting in the contrastive softmax, explicit embedding compression for underperforming classes, and an implicit margin to widen separation for rare classes.
Two-term RCL with HPO (Sors et al., 2021): Decomposes the loss into positive and negative contributions with separate tunable weights, optimized efficiently using coordinate descent.

These frameworks often share a two-branch network structure: a rebalanced contrastive learning head tightly integrated with a class-prior-aware classification head, facilitating bidirectional information flow via shared prototypes and representation spaces (Li et al., 2024, Alvis et al., 2023).

7. Implications, Best Practices, and Future Directions

The adoption of RCL has established new SOTA on multiple long-tailed benchmarks, both for vision and NLP tasks, while being orders of magnitude more efficient than large-scale LLM fine-tuning for comparable accuracy (Li et al., 2024). RCL designs are robust to the choice of hyperparameters—e.g., weights for RCL in $\mathcal{B}$ 0 and temperature in $\mathcal{B}$ 1 work reliably.

Recommended practices include:

Exposing loss weighting and sampling hyperparameters for direct tuning, not burying them in aggregation schemes (Sors et al., 2021).
Ensuring each class, especially rare classes, appears as both anchor and target in contrastive computations, preferably via prototypes or memory banks (Li et al., 2024, Alvis et al., 2023, Zhong et al., 2022).
Employing hard pair mining and synthetic augmentation to prevent gradient collapse on trivial (“easy”) pairs (Li et al., 2024, Zhong et al., 2022).
Jointly tuning classification and contrastive components—especially class-prior adjustment factors—to optimize balanced test metrics.

Future work may extend RCL to unsupervised, semi-supervised, or federated scenarios, increase the sophistication of synthetic augmentation (beyond Mixup), and further explore the theoretical underpinnings of margin-based regularization for rare class generalization (Alvis et al., 2023, Li et al., 2024).