Papers
Topics
Authors
Recent
Search
2000 character limit reached

Retrieval-Based Metrics: P@1 & CSLS

Updated 13 January 2026
  • Retrieval-based metrics are evaluation methods that measure the alignment of cross-lingual embeddings by assessing if true parallel pairs are each other’s nearest neighbors.
  • They employ cosine similarity and local scaling to adjust for hubness, enabling precise ranking of source languages for tasks like NER and POS tagging.
  • Empirical studies show that both P@1 and CSLS correlate moderately (ρ ≈ 0.40) with downstream task performance, aiding in informed model and checkpoint selection.

Retrieval-based metrics, specifically Precision-at-1 (P@1) and Cross-domain Similarity Local Scaling (CSLS), are core evaluation methods for quantifying alignment between cross-lingual embedding spaces. Directly measuring whether true parallel sentence pairs are each other’s nearest neighbors in a shared space, these metrics are pivotal in assessing transferable representation quality for cross-lingual NLP, especially in low-resource settings. Their simple, retrieval-driven design allows practitioners to rank source languages for transfer learning, with proven effectiveness for downstream tasks such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging (Idris et al., 6 Jan 2026).

1. Formal Definitions: Precision-at-1 and CSLS

Given S={s1,,sN}\mathcal{S} = \{s_1, \ldots, s_N\} and T={t1,,tN}\mathcal{T} = \{t_1, \ldots, t_N\} as sets of L₂-normalized sentence embeddings for source and target languages, with Mij=sitjM_{ij} = s_i^\top t_j representing the cosine similarity matrix:

  • Precision at 1 (P@1):

For each sis_i, if its nearest neighbor in T\mathcal{T} (by cosine similarity) is tit_i (i.e., its gold translation), count as correct. The metric is:

P@1(S ⁣ ⁣T)=1Ni=1NI(argmaxjsitj=i)\mathrm{P}@1(\mathcal{S}\!\to\!\mathcal{T}) = \frac{1}{N}\sum_{i=1}^{N} \mathbb{I}\Bigl(\arg\max_{j} s_i^{\top}t_j = i\Bigr)

where I()\mathbb{I}(\cdot) is the indicator function.

  • Cross-domain Similarity Local Scaling (CSLS):

Designed to correct “hubness,” CSLS rescales neighbor similarities. For each sis_i, define

rT(si)=1ktjNk(si)sitjr_T(s_i) = \frac{1}{k}\sum_{t_j\in\mathcal{N}_k(s_i)} s_i^{\top} t_j

with Nk(si)\mathcal{N}_k(s_i) the kk-nearest neighbors of sis_i in T\mathcal{T}. Analogously, rS(tj)r_S(t_j) is defined for tjt_j in T\mathcal{T}. Then

CSLS(si,tj)=2sitjrT(si)rS(tj)\mathrm{CSLS}(s_i, t_j) = 2 s_i^{\top} t_j - r_T(s_i) - r_S(t_j)

The mean CSLS across all gold pairs (si,ti)(s_i, t_i) is reported (k=10k=10).

2. Computation and Implementation

The standard evaluation pipeline involves the following steps:

  • Embedding extraction: Compute mean-pooled, final-layer, L₂-normalized token vectors using a pretrained encoder for each FLORES-200 sentence.
  • Similarity matrix: For N=1,012N=1,012 parallel sentence pairs, create MRN×NM \in \mathbb{R}^{N \times N} by exhaustive dot products.
  • P@1 calculation: For each sis_i, identify argmaxjMij\arg\max_{j} M_{ij}; calculate the proportion where j=ij = i.
  • CSLS calculation: For sis_i, average the top-kk target similarities to get rT(si)r_T(s_i); likewise for rS(ti)r_S(t_i). For each pair (si,ti)(s_i, t_i), compute CSLS(si,ti)\mathrm{CSLS}(s_i, t_i) and average.

This approach is feasible for datasets of this scale and aligns with experimental procedures used in recent cross-lingual transfer research (Idris et al., 6 Jan 2026).

3. Empirical Evaluation in Cross-Lingual Transfer

In systematic experiments involving 816 zero-shot transfer evaluations across 12 African languages, three multilingual pretrained models (AfriBERTa, AfroXLM-R, Serengeti), and three tasks (NER, POS, sentiment):

  • Moderate predictive power: Both P@1 and CSLS achieve mean Spearman rank correlation ρ0.40\rho \approx 0.40 with downstream task accuracy (F1_1 or overall accuracy).
  • Superiority over CKA: Structural Centered Kernel Alignment (CKA) yields much weaker correlation (ρ0.10\rho \approx 0.10).
  • Comparable to cosine gap: Retrieval-based metrics are on par with, but largely complementary to, cosine gap (mean ρ=0.41\rho = 0.41).

Table 1: Overall Metric Correlations (NER/POS/Sentiment, n=9n=9 models × tasks)

Metric Mean ρ\rho Std Significant Cases
P@1 0.40 0.14 7/9
CSLS 0.40 0.14 6/9
CKA 0.10 0.18 2/9

4. Task and Model-Specific Insights

The predictive strength of P@1 and CSLS varies by task and model:

  • NER: Correlations in the range 0.20–0.58, with top performance observed for AfriBERTa and Serengeti (P@1 up to 0.56, CSLS up to 0.58).
  • POS: Consistent correlations (0.41–0.53) across all models.
  • Sentiment: Lower, less stable correlations (0.23–0.37), likely reflecting domain mismatch with the test data.

Inter-metric agreement is exceptionally high between P@1 and CSLS (ρP@1,CSLS=0.97\rho_{\text{P@1,CSLS}}=0.97 across 396 language-model combinations), indicating near-interchangeability in practical source-selection scenarios.

Simpson’s Paradox is observed: if correlations are computed across heterogeneous models, correlation signs can reverse (e.g., cosine gap correlation with NER F1_1 flips from strongly positive within-model to negative when pooling). A plausible implication is that all retrieval-based metric evaluations must be restricted to per-model analysis for reliable conclusions.

5. Contextual and Practical Considerations

Retrieval-based metrics directly evaluate cross-lingual alignment by testing if gold parallel pairs are each other's nearest neighbors, providing an explicit test of embedding-space quality for transfer tasks. Robust alignment as measured by high P@1 or CSLS is almost always associated with improved zero-shot discriminative task transfer (NER, POS).

Domain specificity is critical: metrics computed with formal, in-domain evaluation data (FLORES-200) do not consistently predict transfer in out-of-domain (e.g., social-media sentiment) settings, especially with small nn. Practitioners should restrict metric application to parallel data closely matching the intended deployment scenario (Idris et al., 6 Jan 2026).

Practically, P@1 and CSLS can be used to filter candidate source languages or model checkpoints for target tasks. Given the near-perfect redundancy, the choice between them may be guided by computational simplicity (P@1) versus explicit hubness correction (CSLS), with neighborhood size k=10k=10 recommended for CSLS.

6. Comparative and Hybrid Approaches

For sources where embedding-based metrics underperform (e.g., AfroXLM-R, mean ρ=0.22\rho=0.22 for sentiment), typological distance-based metrics such as those from URIEL may provide superior predictive power (e.g., POS: URIEL ρ=0.65|\rho|=0.65, cosine gap ρ=0.49|\rho|=0.49). Optimal source selection may thus require hybrid ranking strategies leveraging both retrieval-based and typology-driven signals.

Concrete source-selection evidence confirms the effectiveness of retrieval-based metrics: for NER using AfriBERTa, the best source is ranked top-1 by cosine gap 50% of the time (cf. 9% random) and top-3 83% of the time; P@1 and CSLS yield nearly identical performance.

7. Recommendations and Limitations

Retrieval-based metrics offer a transparent, model-aware, and data-specific mechanism for source ranking in cross-lingual transfer, typically providing moderate, model-specific predictiveness (mean ρ0.40\rho \sim 0.40 for NER and POS). Their main limitations are their reliance on in-domain parallel data and susceptibility to Simpson’s Paradox in pooled analyses. Routine best practice is to validate all metric-task correlations per model, supplementing with typological distance measures where embedding-space similarity is less predictive. These metrics are particularly recommended as a first-pass filter under low-resource conditions, provided that in-domain data and model-specific evaluation are strictly observed (Idris et al., 6 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Retrieval-Based Metrics (P@1, CSLS).