Contrastive Learning with MLM
- Contrastive Learning with MLM is a hybrid self-supervised approach that fuses global discriminative objectives with token-level reconstruction for enhancing representation quality.
- It leverages a combined loss—using contrastive objectives like InfoNCE and MLM’s negative log-likelihood—to balance holistic alignment and detailed context recovery.
- Empirical results demonstrate improved performance in retrieval, classification, and robustness across domains such as language, speech, code, and multimodal tasks.
Contrastive learning with masked language modeling (MLM) refers to a family of self-supervised or semi-supervised frameworks that combine two core representation learning objectives: 1) a contrastive objective that pulls together representations of related (“positive”) inputs and pushes apart unrelated (“negative”) inputs, and 2) token-level MLM, in which the model reconstructs randomly masked portions of the input sequence from unmasked context. Recent developments reveal that this hybrid approach consistently improves the quality, informativeness, and robustness of learned representations across domains including language, speech, code, protein sequences, and multimodal tasks. Architectures and training procedures vary, reflecting domain-specific requirements and nuances, but certain core design patterns and empirical findings are broadly shared.
1. Motivation and Theoretical Basis
The primary motivation for combining contrastive learning and MLM is to exploit complementary inductive biases at the sequence and token level. Contrastive learning (CL) enforces global or “holistic” discrimination between examples by aligning representations of positive pairs and separating negatives. This structure injects desirable geometric properties—such as alignment and uniformity—into the embedding space. However, it may overlook or underutilize fine-grained input details, leading to superficial similarity collapse or insensitivity to information not included in the contrastive construction (Wu et al., 2022, Chuang et al., 2022).
MLM, by contrast, operates at the token level, encouraging contextual composition and information aggregation required for masked token reconstruction. Pure MLM, however, may yield representations that are not globally discriminative, and is prone to overfitting, codebook collapse (in discrete quantization), or domain-drift when used alone (Wu et al., 2022, Chung et al., 2021, Pavlova et al., 19 Oct 2025).
Joint optimization of both objectives is motivated by the need for representations that are at once information-dense, generative (enabling reconstruction), and highly discriminative for retrieval, classification, and transfer learning. A further rationale is that simultaneous sequence- and token-level supervision allows for stable end-to-end training and circumvents the optimization issues encountered with either objective in isolation (Wu et al., 2022, Chung et al., 2021).
2. Core Methodological Paradigms
2.1 Loss Formulations and Joint Optimization
Most approaches employ an additive or weighted combination of contrastive and MLM losses, commonly expressed as: where is typically a batch or queue-based InfoNCE or margin-ranking loss, and is the negative log-likelihood of masked token reconstruction over selected positions. The balance factor is tuned empirically or by ablation (Wu et al., 2022, Chung et al., 2021, Pavlova et al., 19 Oct 2025).
Architecturally, some frameworks use a single backbone with two heads (e.g., MLM decoder and contrastive embedding projector), while others employ auxiliary networks to mediate gradient flow and prevent MLM-induced perturbation of contrastive representations (Wu et al., 2022, Zhang et al., 2023). Careful gradient routing—such as freezing shared layers or restricting MLM computation to shallow auxiliary modules—is crucial to prevent adverse interference (Wu et al., 2022, Zhang et al., 2023).
2.2 Construction of Positive/Negative Pairs and "Views"
Different domains admit distinct augmentation strategies for view generation:
- In language, views may be generated by applying different dropout masks (Wu et al., 2022), paraphrasing via prompts/demonstrations (Jian et al., 2022), or generating MLM-edited sentences (Chuang et al., 2022).
- For code, explicit program transformations (renaming, reordering, dead code insertion) and natural language paraphrasing serve as augmentations (Liu et al., 2023).
- In cross-modal regimes (e.g., speech, audio-text), views may correspond to paired latent or quantized representations produced by different modules within the network (Chung et al., 2021, Borodin et al., 31 Mar 2026).
Negatives are sampled from other in-batch or memory-queue examples as in MoCo (Liu et al., 2023), or all non-paired examples in supervised settings (Jian et al., 2022). The supervision or “label” varies: pseudo-labels (unsupervised), task-defined classes, or hard negatives (matched by semantic similarity or structured mining).
2.3 Domain-Specific Adaptations
Hybrid contrastive + MLM objectives are tailored for different data types:
- Text/sentence embeddings: InfoCSE (Wu et al., 2022), DiffCSE (Chuang et al., 2022), CMLM-CSE (Zhang et al., 2023), Auto-MLM (Xu et al., 2022), and MOSAIC (Pavlova et al., 19 Oct 2025) all demonstrate variants of this paradigm, integrating modifications to loss structure, view construction, or auxiliary networks.
- Speech: w2v-BERT (Chung et al., 2021) discretizes latent speech features via a quantizer for contrastive learning, and predicts masked discrete targets with a context-aware MLM head.
- Code: ContraBERT (Liu et al., 2023) combines code- and text-based augmentations, supporting both MLM on corrupted code snippets and contrastive learning between original and augmented pairs.
- Protein/biomedical sequences: SCEPTR (Nagano et al., 2024) employs joint autocontrastive and masked-learning on structured TCR inputs.
- Multimodal TTS: Methods such as (Borodin et al., 31 Mar 2026) first pretrain text and phoneme encoders with MLM, then apply cross-modal contrastive objectives with intrinsic retrieval and generative downstream assessment.
3. Representative Algorithms and Implementation Strategies
The following table summarizes architectural and training characteristics of selected frameworks:
| Method | Domains | View Gen. | Loss Structure |
|---|---|---|---|
| InfoCSE (Wu et al., 2022) | Text | Dropout masks | Contrastive + aux MLM (aux head) |
| DiffCSE (Chuang et al., 2022) | Text | Dropout & MLM-edits | Contrastive + RTD loss |
| ContraBERT (Liu et al., 2023) | Code/Text | Code+text augment | MoCo-based contrast + MLM |
| KECP (Wang et al., 2022) | QA | Prompted masking | Span-level contrast + MLM |
| w2v-BERT (Chung et al., 2021) | Speech | Audio masking/quant | Audio contrastive + discrete MLM |
| SCEPTR (Nagano et al., 2024) | Protein | Mask/chain drop + dropout | Autocontrastive + MLM |
Auxiliary modules are employed to restrict MLM gradients, preserve discriminative power, or inject sequence-level features into token-level predictions. For instance, InfoCSE freezes lower encoder layers for MLM computation, combining outputs with [CLS] via aggregation before passing through specialized reconstruction heads (Wu et al., 2022). CMLM-CSE concatenates sentence embeddings with local token features prior to MLM prediction by a lightweight fusion Transformer (Zhang et al., 2023). Auto-MLM sums sentence vectors with token positions in masked sentences, reflecting tight integration of global and local signals (Xu et al., 2022).
Optimization and regularization choices—including temperature settings for InfoNCE, weighting of objectives (), masking rates, and auxiliary network depth—are consistently established via ablation, with best values often diverging between contrastive-only, MLM-only, and hybrid models (Wu et al., 2022, Zhang et al., 2023, Pavlova et al., 19 Oct 2025).
4. Empirical Findings and Domain Impact
Simultaneous optimization of contrastive and MLM losses yields:
- Substantial improvements in downstream performance for retrieval, clustering, similarity, and classification tasks, notably in under-resourced and few-shot regimes (Wu et al., 2022, Liu et al., 2023, Wang et al., 2022, Pavlova et al., 19 Oct 2025).
- Greater robustness against spurious or adversarial augmentations, e.g., variable renaming attacks in code (Liu et al., 2023).
- Measurable gains in transfer metrics (STS, nDCG@10, AUROC) versus baseline models trained with only one objective (Wu et al., 2022, Pavlova et al., 19 Oct 2025, Nagano et al., 2024).
Specific observations:
- In sentence encoding, InfoCSE achieves +2.6 pts Spearman’s correlation over SimCSE, with ablations showing that removal of either objective significantly reduces performance (Wu et al., 2022).
- In speech, w2v-BERT reduces WER by 30–40% relative to prior models on LibriSpeech “test-other” and yields >30% relative gain on Voice Search traffic when compared to conformer wav2vec 2.0 (Chung et al., 2021).
- In code, ContraBERT achieves 90.46 MAP@R on clone detection, robust to adversarial edits, and outperforms CodeBERT in all core tasks (Liu et al., 2023).
- Domain-adapted models (MOSAIC) trained with joint objectives and domain-restricted masking outperform naive transfer and pure-contrastive re-training by up to +13.4 NDCG@10 in biomedical and low-resource text retrieval (Pavlova et al., 19 Oct 2025).
- In TCR representation, SCEPTR’s joint AC+MLM pretraining consistently beats general PLMs and alignment-based methods across all tested metrics (Nagano et al., 2024).
Ablation studies systematically demonstrate that discarding contrastive or MLM components reduces overall performance or leads to collapsed/degenerate representations, validating the necessity of combined objectives for effective representation learning.
5. Analysis of Failure Modes, Limitations, and Trade-Offs
The efficacy of hybrid contrastive+MLM frameworks is contingent on:
- Careful balance of loss weights; over- or under-weighting MLM can respectively wash out the discriminative structure or collapse token-level information (Wu et al., 2022, Zhang et al., 2023, Pavlova et al., 19 Oct 2025).
- Design of view and negative sampling strategies; poorly chosen augmentations may distort semantic alignment or introduce artifacts, especially in structured or multimodal domains (Liu et al., 2023, Borodin et al., 31 Mar 2026).
- Domain adaptation: methods like MOSAIC demonstrate that restricting MLM supervision to domain-specific vocabulary is critical for achieving domain-relevant adaptation without corrupting base semantic structures (Pavlova et al., 19 Oct 2025).
A trade-off frequently encountered is that gains in intrinsic embedding quality or retrieval metrics (alignment, recall) do not always translate to improved downstream generative performance, as shown in prosody-aware TTS systems (Borodin et al., 31 Mar 2026). Over-optimization towards prosodic or fine-grained discriminative targets can erode the encoder’s ability to reliably encode core linguistic or structural attributes, e.g., phoneme identity.
A plausible implication is that continued progress requires adaptive weighting or curriculum strategies that dynamically balance global discrimination and local generative constraints—in some cases, staged or interleaved training appears optimal (Pavlova et al., 19 Oct 2025, Borodin et al., 31 Mar 2026).
6. Extensions, Variations, and Future Research Directions
Several methodological extensions and open problems are evident:
- Auxiliary Networks: Use of specialized auxiliary or conditional MLM heads (as in InfoCSE and CMLM-CSE) to control gradient interference and infuse coverage of global sentence features is increasingly standard, but optimal architectural design, depth, and pretraining regimes remain to be characterized (Wu et al., 2022, Zhang et al., 2023).
- Equivariant Objectives: The introduction of equivariant contrastive learning, where particular augmentations are treated as “harmful” and the encoder is encouraged to be sensitive to such edits (e.g., through RTD losses in DiffCSE), generalizes the contrastive+MLM paradigm (Chuang et al., 2022).
- Domain-Selective Masking: Selectively masking only domain tokens (MOSAIC) or information-critical regions suggests a general pattern for controlled adaptation without generalization collapse, but requires robust identification/extraction pipelines (Pavlova et al., 19 Oct 2025).
- Span-Level and Task-Specific Contrastive Losses: In extractive QA and tasks involving structured outputs, MLM-based span generation and ranking (KECP) introduces task-oriented contrastive supervision, directly aligning the pretraining and downstream objectives (Wang et al., 2022).
- Multimodal Hybridization: Joint text-audio pretraining in speech and TTS, as with w2v-BERT and recent dual-stream alignments, suggests a general route for contrastive+MLM approaches to bridge distinct modalities (Chung et al., 2021, Borodin et al., 31 Mar 2026).
Promising research directions include dynamic objective scheduling, adaptive augmentation selection, and fine-grained introspection/evaluation of embedding geometries under combined supervision.
References:
- "W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training" (Chung et al., 2021)
- "Contrastive learning of T cell receptor representations" (Nagano et al., 2024)
- "Contrastive Learning for Prompt-Based Few-Shot Language Learners" (Jian et al., 2022)
- "InfoCSE: Information-aggregated Contrastive Learning of Sentence Embeddings" (Wu et al., 2022)
- "DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings" (Chuang et al., 2022)
- "KECP: Knowledge Enhanced Contrastive Prompting for Few-shot Extractive Question Answering" (Wang et al., 2022)
- "ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning" (Liu et al., 2023)
- "MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning" (Pavlova et al., 19 Oct 2025)
- "CMLM-CSE: Based on Conditional MLM Contrastive Learning for Sentence Embeddings" (Zhang et al., 2023)
- "Auto-MLM: Improved Contrastive Learning for Self-supervised Multi-lingual Knowledge Retrieval" (Xu et al., 2022)
- "Combining Masked Language Modeling and Cross-Modal Contrastive Learning for Prosody-Aware TTS" (Borodin et al., 31 Mar 2026)