Contrastive Learning with Matched Essay Pairs
- The paper introduces matched essay pairs to leverage contrastive loss functions, such as triplet margin and supervised objectives, to reduce bias in automated essay scoring.
- The method utilizes transformer-based encoders with LoRA adaptation and hierarchical aggregation to capture both fine-grained and holistic textual features.
- Empirical results demonstrate improved fairness metrics and interpretability over baselines, supported by effective triplet construction and balanced sampling.
Contrastive learning with matched essay pairs refers to a family of methods in which neural encoders are trained to recognize equivalence or similarity between essay texts, typically by leveraging explicit pairings of essays that have been attached to the same prompt, scored similarly by human raters, or share some controlled properties. In educational NLP, this approach is increasingly used to mitigate undesired biases such as demographic artifacts and to support interpretable essay or document matching. Contrasting with purely regression-based automated essay scoring (AES), these methods utilize specialized loss functions—including triplet margin loss and supervised contrastive objectives—so that essays deemed similar by expert judgment are mapped to proximal regions in embedding space, while dissimilar essays are kept apart. Significant recent work in this area includes the Triplet Margin Loss technique for fairness in automated grading systems (Fan et al., 23 Jan 2026) and the CoLDE (Contrastive Long Document Encoder) model for interpretable long-form document matching (Jha et al., 2021).
1. Matched Essay Pair Construction
The foundation of contrastive learning in this context is the construction of matched essay pairs or triplets, balancing both positive (matched) and negative (mismatched) examples. In the bias mitigation setting for AES (Fan et al., 23 Jan 2026), this involves:
- Corpus Merging and Score Normalization: Combining diverse essay corpora such as ASAP 2.0 (native and ESL essays) and ELLIPSE (ESL writing), normalizing rubric scores into the [0, 1] range to allow fair comparison.
- Stratified Sampling: Splitting the data into train/test splits while preserving the ESL/native essay ratio.
- Triplet Construction: For each native essay with normalized human score , selecting:
- Positive: an ESL essay with score (within ±1% of the normalized range).
- Negative: any essay (native or ESL) with score (at least 20% away).
- This yields a large set of essay triplets (17,161 in (Fan et al., 23 Jan 2026)) suitable for triplet margin learning.
By matching essays on prompt and score, these pairings explicitly control for confounders and permit the model to focus on underlying essay quality rather than superficial linguistic traits, a key consideration for reducing demographic bias (Fan et al., 23 Jan 2026).
2. Encoder Architectures and Embedding Strategies
Contrastive learning with matched essay pairs places unique demands on encoder design, particularly to handle long-form and linguistically diverse documents.
- Transformer Backbone with Lightweight Adaptation: The DeBERTa-v3-base transformer is used with LoRA adaptation (rank ) applied selectively to query and value projections, enhancing efficiency and retaining representation quality (Fan et al., 23 Jan 2026).
- Representation Extraction: The [CLS] token's hidden state, a 768-dimensional vector, serves as the essay embedding.
- Structured Aggregation for Long Documents: The CoLDE framework decomposes essays into sections and further into 512-token chunks, with unique positional embeddings for section, chunk, and within-chunk position to address the heterogeneity and length limitation of standard transformers (Jha et al., 2021).
- Hierarchical Aggregation: Chunk embeddings are processed through Bi-LSTM and multi-head attention over sections, supporting multi-level interpretability and managing context across the essay.
These choices enable learning both fine-grained and holistic notions of essay similarity, crucial when matched pairings are based not solely on surface linguistic overlap but also on underlying human-rated quality or semantic intent.
3. Contrastive Objectives and Loss Functions
The central optimization in these methods utilizes contrastive objectives that directly supervise the geometry of the learned embedding space.
- Triplet Margin Loss: Used to align native and ESL essays of matched quality:
where , , denote anchor (native), positive (matched ESL), and negative essay embeddings, is Euclidean distance, and is the margin (set to 1.0 after ablation) (Fan et al., 23 Jan 2026). This setup forces essays of identical human-rated quality (across demographics) to be embedded closer than those with diverging quality.
- Supervised Contrastive Loss: Structured for document sections in CoLDE:
with comprising all positive sections and for temperature scaling (Jha et al., 2021).
The use of such losses explicitly sculpts the embedding space to respect human judgments of equivalence rather than statistical artifacts, directly addressing issues such as demographic bias or semantic drift in document matching.
4. Training Regimes and Hyperparameter Choices
Careful orchestration of training is essential for stability and fairness.
- Contrastive Pretraining: Only LoRA parameters are updated during the contrastive phase (learning rate 1e-4; batch size 32; 2 epochs) in (Fan et al., 23 Jan 2026). In CoLDE, larger batches (50–200), learning rate 5e-5 (AdamW), section number –5, chunk size 512, and temperature 0.5 are used (Jha et al., 2021).
- Regression Head Fine-Tuning: In (Fan et al., 23 Jan 2026), after freezing the backbone post-contrastive training, a linear regression head is fit on the embeddings for 5 epochs (MSE loss; learning rate 1e-3).
- Augmentation for Long Documents: If essays lack explicit structure, uniform segmentation into equal-length sections suffices; for very long documents, chunk and section counts are tuned according to GPU memory constraints (Jha et al., 2021).
Overfitting and batch imbalance are managed via stratified sampling and careful negative sampling strategies, as under- or overrepresented groups in matched pairings can degrade embedding fidelity.
5. Evaluation Metrics, Results, and Interpretability
The effectiveness of contrastive learning with matched essay pairs is evaluated on both quantitative metrics and qualitative interpretability.
| Model | QWK | Bias Gap (High-Proficiency ESL) | Fairness Gain |
|---|---|---|---|
| Baseline DeBERTa-v3-base | 0.792 | 0.103 | – |
| Contrastive (Triplet Margin) | 0.756 | 0.062 | –39.9% |
| Ablation (α=2.0 margin) | 0.718 | 0.064 | Not optimal |
- Quadratic Weighted Kappa (QWK): Assesses agreement between model scores and human raters.
- Bias Gap: The score differential between high-proficiency ESL and native essays (same human-rated score).
- Fairness vs. Accuracy Trade-off: Margin hyperparameters directly modulate bias mitigation and scoring reliability (Fan et al., 23 Jan 2026).
CoLDE evaluates document similarity (cosine between pooled embeddings), section-level alignment (section-section similarities), and chunk-level attention maps for interpretability. On document-matching benchmarks, CoLDE outperforms strong baselines on F1 and accuracy, with interpretability at document, section, and chunk granularity (Jha et al., 2021).
Post-hoc analyses using syntactic parsing and sentence complexity metrics reveal that contrastive alignment eliminates confounding correlations between sentence complexity (e.g., subordinate clause usage) and predicted score for ESL essays. This demonstrates successful disentanglement of structural linguistic features from genuine error (Fan et al., 23 Jan 2026).
6. Significance, Broad Applicability, and Limitations
Contrastive learning with matched essay pairs advances both technical robustness and fairness in automated essay scoring systems and long-form document matching.
- Algorithmic Fairness: By enforcing alignment across demographic/linguistic groups, models correct for shortcut learning (e.g., penalizing L2 stylistic complexity) and yield more equitable artifact-resistant scoring (Fan et al., 23 Jan 2026).
- Interpretable Document Matching: The CoLDE framework allows granular attribution of similarity and difference within long-form texts, benefiting applications such as plagiarism detection, paraphrase identification, and educational assessment (Jha et al., 2021).
- Limitations: Small datasets can lead to underfitting in contrastive setups; highly unbalanced positive/negative pairings may bias embedding geometry. Loss of score calibration is possible if contrastive objectives are not carefully balanced with downstream regression loss. For very long or unstructured essays, segmentation strategies and GPU constraints must be considered.
A plausible implication is that as more essay corpora are made available with detailed rater metadata and demographic tags, matched-pair contrastive frameworks will become foundational in both fairness-critical educational NLP and large-scale, interpretable document retrieval tasks.