Contrastive Bi-Encoders (CARP)

Updated 2 January 2026

Contrastive Bi-Encoders (CARP) are dual-encoder architectures that align semantically matched pairs with contrastive loss.
They employ varied backbones like RoBERTa, LaBSE, and BERT to serve tasks including narrative evaluation, preference modeling, NER, and paraphrase identification.
CARP models enable efficient, scalable inference by precomputing embeddings, supporting robust multilingual and zero-shot applications.

Contrastive Bi-Encoders (CARP) constitute a family of dual-encoder architectures paired with contrastive objectives, designed to map input pairs from distinct modalities or domains into a shared vector space such that semantically matched pairs align and mismatches repel. Originating in the context of zero-shot narrative evaluation, CARP and its derivatives have since demonstrated efficacy in diverse domains, including preference modeling for natural language generation, supervised and distantly supervised named entity recognition, and multilingual paraphrase identification.

1. Model Architectures and Encoders

CARP models universally employ a bi-encoder paradigm: two separate deep encoders process input pairs independently. For the prototypical application in narrative evaluation (Matiana et al., 2021), the model consists of two RoBERTa-based masked LLMs (MLMs) with matched architectures, where one encoder processes a story passage $s$ and the other a critique $c$ . Each encoder yields token embeddings; these are aggregated (using masked summation or pooling), $\ell_2$ -normalized, and projected into a fixed-dimensional latent space (e.g., 2048-dim).

The CARP framework is compatible with a range of backbone models:

For story evaluation: RoBERTa Tiny/Base/Large (58M to 715M per branch, up to 1.43B parameters total) (Matiana et al., 2021).
For multilingual paraphrase identification (CARP for cRoss-lingual Paraphrase, Editor's term): LaBSE (12 layers, 768-dim embeddings, shared encoder for both sentence sides) (Fedorova et al., 2024).
For NER: Non-shared BERT-style transformers, one for text spans and one for type descriptions (Zhang et al., 2022).

In all cases, the encoders either share parameters (if inputs are symmetric, e.g., paraphrase pairs) or remain completely distinct, as in story–critique and span–type bi-encoders.

2. Contrastive Training Objectives

The core learning paradigm for CARP is the InfoNCE contrastive loss or its generalizations. Given a batch of $N$ aligned (positive) pairs $(x_i, y_i)$ , and $(N-1)$ in-batch negatives per anchor, the objective enforces high similarity (via cosine, potentially temperature-scaled) for matched pairs and low for mismatches: $L = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp(\text{sim}(x_i, y_i)/\tau)}{\sum_{j=1}^N \exp(\text{sim}(x_i, y_j)/\tau)}$ where $\tau$ is a (possibly learned) temperature. This form is used in both story-critique alignment (Matiana et al., 2021, Castricato et al., 2022) and for mapping entity spans to type embeddings in NER (Zhang et al., 2022).

Extensions in some CARP variants include additive-margin softmax (ArcFace) losses and explicit hard-negative mining (e.g., for paraphrase identification (Fedorova et al., 2024)), where within-batch and externally mined negatives are incorporated to sharpen the decision boundary.

3. Application Domains and Datasets

Contrastive Bi-Encoders (CARP) are deployed in a range of downstream NLP tasks:

3.1 Zero-Shot Narrative Evaluation

CARP is introduced for zero-shot evaluation of computationally generated stories by contrasting story passages with corresponding human critiques (Matiana et al., 2021). The Story-Critique dataset provides 1.3M aligned pairs over 80,000 unique stories, enabling robust training and empirical benchmarking.

3.2 Preference Modeling for Language Generation

CARP models serve as preference reward models by scoring the semantic alignment between generated story candidates and natural language critiques. These similarity scores are used as trajectory-level rewards in reinforcement learning (e.g., PPO) to fine-tune LLMs for controlled generation (Castricato et al., 2022). Discretized preference classes (via pseudo-labeling and CoOp prompt-tuning) provide sharper, more robust rewards for policy optimization.

3.3 Named Entity Recognition

BINDER (Bi-Encoder for NER via Dynamic thresholding) formulates NER as instance-level representation learning, mapping spans and entity type descriptions into a joint space (Zhang et al., 2022). Dynamic per-type thresholds (learned from the [CLS] embedding in context) eliminate the need for an explicit Outside class and enable both flat and nested NER with efficient inference.

3.4 Multilingual Paraphrase Identification

Multilingual CARP bi-encoders (using fine-tuned LaBSE backbones) enable cross-lingual paraphrase detection by aligning semantically equivalent sentences in different languages (Fedorova et al., 2024). The model achieves competitive PAWS-X mean accuracies and is robust to zero-shot extension to new languages.

4. Empirical Results and Benchmarking

Extensive experimental results are reported across tasks:

Story Evaluation (Matiana et al., 2021):
- CARP-Large achieves mean cosine similarities up to 0.9 with human critique selection distributions.
- Outperforms zero-shot and finetuned autoregressive baselines (e.g., GPT-J-6B), with lower KL divergence to human judgments.
- Scaling effects: validation accuracy improves almost linearly with model size.
Preference Modeling (Castricato et al., 2022):
- Topic control accuracy with CARP CoOp reaches 0.62, compared to 0.37 (GeDi-GPT2) and 0.49 (GPT-NeoX-20B).
- Empirically, a small RL-finetuned GPT-2 guided by CARP rewards outperforms models twenty times larger on human alignment criteria.
Named Entity Recognition (Zhang et al., 2022):
- State-of-the-art F1 scores: ACE2004 (89.7), ACE2005 (90.0), GENIA (80.8), CoNLL2003 (93.3), BC5-chem (95.0).
- Ablation studies indicate dynamic per-type thresholds provide superior entity/non-entity separation.
- BINDER delivers efficient inference (2.6K steps/s train, 8.9K inf/s on ACE2005).
Paraphrase Identification (Fedorova et al., 2024):
- CARP bi-encoder (LaBSE backbone) attains 79.3% mean test accuracy on PAWS-X, approaching the 91.7% of ByT5 XXL cross-encoders but with a ∼7–10% relative drop.
- Performance by type: inter-lingual paraphrases (77.4%), intra-lingual (78.3%), inter-lingual-same (82.2%).
- Embedding space quality is validated via Align and Uniform metrics.

5. Inference Procedures and Practical Considerations

The CARP paradigm enables efficient, scalable, and modular inference:

Story/critique and span/type embeddings can be precomputed and stored, enabling large-scale retrieval and fast similarity computation (matrix multiplication).
For story evaluation, candidate critiques are paraphrased for prompt-robustness; similarities are min-shifted and softmaxed to produce a normalized distribution over options (Matiana et al., 2021).
In preference modeling, learned prompt embeddings (CoOp) enable mapping discrete preference classes to dense vectors for reward calculation (Castricato et al., 2022).
For NER, dynamic thresholds per type allow seamless extension to new types by providing additional embeddings, enabling open-domain and zero-shot entity recognition (Zhang et al., 2022).
Multilingual CARP bi-encoders facilitate zero-shot adaptation to new languages by freezing the wordpiece embedding and fine-tuning only higher layers (Fedorova et al., 2024).

6. Design Rationale, Limitations, and Extensions

Contrastive bi-encoder architectures offer several advantages:

Decoupled inference: Input pairs are processed independently, enabling offline embedding and sublinear retrieval at inference time.
Robustness to noisy/distant supervision: Contrastive learning tolerates noisy negatives and supports self-supervised or distantly supervised regimes (e.g., dictionary-based NER).
Generalization: Shared latent spaces foster transfer across domains, entity types, or languages.
Flexible extension: New labels or languages can be introduced via additional description embeddings or minimal adaptation.

Limitations include increased computational costs for large sets of candidate pairs (especially O( $L^2$ ) scaling for NER) and the sensitivity of threshold learning to contextual variance, which may require further adaptive calibration (Zhang et al., 2022). For paraphrase identification, there remains a small but stable performance gap relative to tightly-coupled cross-encoder models, although the modularity and efficiency of the bi-encoder remains desirable (Fedorova et al., 2024).

7. Representative Configurations

Variant	Application Domain	Encoder(s)	Vector Dim	Key Loss
CARP Tiny/Base/Large (Matiana et al., 2021)	Story Evaluation	RoBERTa	2048	InfoNCE + learned τ
CARP CoOp (Castricato et al., 2022)	RL Preference Reward	RoBERTa/BERT	2048	InfoNCE + soft prompt
BINDER (Zhang et al., 2022)	NER	BERT (non-shared)	d × K typ.	InfoNCE + dyn. thresh.
CARP (LaBSE) (Fedorova et al., 2024)	Paraphrase ID, Multil.	LaBSE	768	ArcFace + hard negs

Each configuration is tailored by domain requirements (e.g., whether parameter sharing is needed, scale of candidate space, or availability of hard negatives) and backbone properties.

Contrastive Bi-Encoders (CARP) define a principled approach to semantic alignment through dual encoding and contrastive learning, with demonstrated utility in evaluation, control, recognition, and cross-lingual understanding across multiple NLP domains (Matiana et al., 2021, Castricato et al., 2022, Zhang et al., 2022, Fedorova et al., 2024).