Retrieval-Based Metrics: P@1 & CSLS

Updated 13 January 2026

Retrieval-based metrics are evaluation methods that measure the alignment of cross-lingual embeddings by assessing if true parallel pairs are each other’s nearest neighbors.
They employ cosine similarity and local scaling to adjust for hubness, enabling precise ranking of source languages for tasks like NER and POS tagging.
Empirical studies show that both P@1 and CSLS correlate moderately (ρ ≈ 0.40) with downstream task performance, aiding in informed model and checkpoint selection.

Retrieval-based metrics, specifically Precision-at-1 (P@1) and Cross-domain Similarity Local Scaling (CSLS), are core evaluation methods for quantifying alignment between cross-lingual embedding spaces. Directly measuring whether true parallel sentence pairs are each other’s nearest neighbors in a shared space, these metrics are pivotal in assessing transferable representation quality for cross-lingual NLP, especially in low-resource settings. Their simple, retrieval-driven design allows practitioners to rank source languages for transfer learning, with proven effectiveness for downstream tasks such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging (Idris et al., 6 Jan 2026).

1. Formal Definitions: Precision-at-1 and CSLS

Given $\mathcal{S} = \{s_1, \ldots, s_N\}$ and $\mathcal{T} = \{t_1, \ldots, t_N\}$ as sets of L₂-normalized sentence embeddings for source and target languages, with $M_{ij} = s_i^\top t_j$ representing the cosine similarity matrix:

Precision at 1 (P@1):

For each $s_i$ , if its nearest neighbor in $\mathcal{T}$ (by cosine similarity) is $t_i$ (i.e., its gold translation), count as correct. The metric is:

$\mathrm{P}@1(\mathcal{S}\!\to\!\mathcal{T}) = \frac{1}{N}\sum_{i=1}^{N} \mathbb{I}\Bigl(\arg\max_{j} s_i^{\top}t_j = i\Bigr)$

where $\mathbb{I}(\cdot)$ is the indicator function.

Cross-domain Similarity Local Scaling (CSLS):

Designed to correct “hubness,” CSLS rescales neighbor similarities. For each $s_i$ , define

$r_T(s_i) = \frac{1}{k}\sum_{t_j\in\mathcal{N}_k(s_i)} s_i^{\top} t_j$

with $\mathcal{N}_k(s_i)$ the $k$ -nearest neighbors of $s_i$ in $\mathcal{T}$ . Analogously, $r_S(t_j)$ is defined for $t_j$ in $\mathcal{T}$ . Then

$\mathrm{CSLS}(s_i, t_j) = 2 s_i^{\top} t_j - r_T(s_i) - r_S(t_j)$

The mean CSLS across all gold pairs $(s_i, t_i)$ is reported ( $k=10$ ).

2. Computation and Implementation

The standard evaluation pipeline involves the following steps:

Embedding extraction: Compute mean-pooled, final-layer, L₂-normalized token vectors using a pretrained encoder for each FLORES-200 sentence.
Similarity matrix: For $N=1,012$ parallel sentence pairs, create $M \in \mathbb{R}^{N \times N}$ by exhaustive dot products.
P@1 calculation: For each $s_i$ , identify $\arg\max_{j} M_{ij}$ ; calculate the proportion where $j = i$ .
CSLS calculation: For $s_i$ , average the top- $k$ target similarities to get $r_T(s_i)$ ; likewise for $r_S(t_i)$ . For each pair $(s_i, t_i)$ , compute $\mathrm{CSLS}(s_i, t_i)$ and average.

This approach is feasible for datasets of this scale and aligns with experimental procedures used in recent cross-lingual transfer research (Idris et al., 6 Jan 2026).

3. Empirical Evaluation in Cross-Lingual Transfer

In systematic experiments involving 816 zero-shot transfer evaluations across 12 African languages, three multilingual pretrained models (AfriBERTa, AfroXLM-R, Serengeti), and three tasks (NER, POS, sentiment):

Moderate predictive power: Both P@1 and CSLS achieve mean Spearman rank correlation $\rho \approx 0.40$ with downstream task accuracy (F $_1$ or overall accuracy).
Superiority over CKA: Structural Centered Kernel Alignment (CKA) yields much weaker correlation ( $\rho \approx 0.10$ ).
Comparable to cosine gap: Retrieval-based metrics are on par with, but largely complementary to, cosine gap (mean $\rho = 0.41$ ).

Table 1: Overall Metric Correlations (NER/POS/Sentiment, $n=9$ models × tasks)

Metric	Mean $\rho$	Std	Significant Cases
P@1	0.40	0.14	7/9
CSLS	0.40	0.14	6/9
CKA	0.10	0.18	2/9

4. Task and Model-Specific Insights

The predictive strength of P@1 and CSLS varies by task and model:

NER: Correlations in the range 0.20–0.58, with top performance observed for AfriBERTa and Serengeti (P@1 up to 0.56, CSLS up to 0.58).
POS: Consistent correlations (0.41–0.53) across all models.
Sentiment: Lower, less stable correlations (0.23–0.37), likely reflecting domain mismatch with the test data.

Inter-metric agreement is exceptionally high between P@1 and CSLS ( $\rho_{\text{P@1,CSLS}}=0.97$ across 396 language-model combinations), indicating near-interchangeability in practical source-selection scenarios.

Simpson’s Paradox is observed: if correlations are computed across heterogeneous models, correlation signs can reverse (e.g., cosine gap correlation with NER F $_1$ flips from strongly positive within-model to negative when pooling). A plausible implication is that all retrieval-based metric evaluations must be restricted to per-model analysis for reliable conclusions.

5. Contextual and Practical Considerations

Retrieval-based metrics directly evaluate cross-lingual alignment by testing if gold parallel pairs are each other's nearest neighbors, providing an explicit test of embedding-space quality for transfer tasks. Robust alignment as measured by high P@1 or CSLS is almost always associated with improved zero-shot discriminative task transfer (NER, POS).

Domain specificity is critical: metrics computed with formal, in-domain evaluation data (FLORES-200) do not consistently predict transfer in out-of-domain (e.g., social-media sentiment) settings, especially with small $n$ . Practitioners should restrict metric application to parallel data closely matching the intended deployment scenario (Idris et al., 6 Jan 2026).

Practically, P@1 and CSLS can be used to filter candidate source languages or model checkpoints for target tasks. Given the near-perfect redundancy, the choice between them may be guided by computational simplicity (P@1) versus explicit hubness correction (CSLS), with neighborhood size $k=10$ recommended for CSLS.

6. Comparative and Hybrid Approaches

For sources where embedding-based metrics underperform (e.g., AfroXLM-R, mean $\rho=0.22$ for sentiment), typological distance-based metrics such as those from URIEL may provide superior predictive power (e.g., POS: URIEL $|\rho|=0.65$ , cosine gap $|\rho|=0.49$ ). Optimal source selection may thus require hybrid ranking strategies leveraging both retrieval-based and typology-driven signals.

Concrete source-selection evidence confirms the effectiveness of retrieval-based metrics: for NER using AfriBERTa, the best source is ranked top-1 by cosine gap 50% of the time (cf. 9% random) and top-3 83% of the time; P@1 and CSLS yield nearly identical performance.

7. Recommendations and Limitations

Retrieval-based metrics offer a transparent, model-aware, and data-specific mechanism for source ranking in cross-lingual transfer, typically providing moderate, model-specific predictiveness (mean $\rho \sim 0.40$ for NER and POS). Their main limitations are their reliance on in-domain parallel data and susceptibility to Simpson’s Paradox in pooled analyses. Routine best practice is to validate all metric-task correlations per model, supplementing with typological distance measures where embedding-space similarity is less predictive. These metrics are particularly recommended as a first-pass filter under low-resource conditions, provided that in-domain data and model-specific evaluation are strictly observed (Idris et al., 6 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Can Embedding Similarity Predict Cross-Lingual Transfer? A Systematic Study on African Languages (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Retrieval-Based Metrics (P@1, CSLS).