Retrieval-Based Metrics: P@1 & CSLS
- Retrieval-based metrics are evaluation methods that measure the alignment of cross-lingual embeddings by assessing if true parallel pairs are each other’s nearest neighbors.
- They employ cosine similarity and local scaling to adjust for hubness, enabling precise ranking of source languages for tasks like NER and POS tagging.
- Empirical studies show that both P@1 and CSLS correlate moderately (ρ ≈ 0.40) with downstream task performance, aiding in informed model and checkpoint selection.
Retrieval-based metrics, specifically Precision-at-1 (P@1) and Cross-domain Similarity Local Scaling (CSLS), are core evaluation methods for quantifying alignment between cross-lingual embedding spaces. Directly measuring whether true parallel sentence pairs are each other’s nearest neighbors in a shared space, these metrics are pivotal in assessing transferable representation quality for cross-lingual NLP, especially in low-resource settings. Their simple, retrieval-driven design allows practitioners to rank source languages for transfer learning, with proven effectiveness for downstream tasks such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging (Idris et al., 6 Jan 2026).
1. Formal Definitions: Precision-at-1 and CSLS
Given and as sets of L₂-normalized sentence embeddings for source and target languages, with representing the cosine similarity matrix:
- Precision at 1 (P@1):
For each , if its nearest neighbor in (by cosine similarity) is (i.e., its gold translation), count as correct. The metric is:
where is the indicator function.
- Cross-domain Similarity Local Scaling (CSLS):
Designed to correct “hubness,” CSLS rescales neighbor similarities. For each , define
with the -nearest neighbors of in . Analogously, is defined for in . Then
The mean CSLS across all gold pairs is reported ().
2. Computation and Implementation
The standard evaluation pipeline involves the following steps:
- Embedding extraction: Compute mean-pooled, final-layer, L₂-normalized token vectors using a pretrained encoder for each FLORES-200 sentence.
- Similarity matrix: For parallel sentence pairs, create by exhaustive dot products.
- P@1 calculation: For each , identify ; calculate the proportion where .
- CSLS calculation: For , average the top- target similarities to get ; likewise for . For each pair , compute and average.
This approach is feasible for datasets of this scale and aligns with experimental procedures used in recent cross-lingual transfer research (Idris et al., 6 Jan 2026).
3. Empirical Evaluation in Cross-Lingual Transfer
In systematic experiments involving 816 zero-shot transfer evaluations across 12 African languages, three multilingual pretrained models (AfriBERTa, AfroXLM-R, Serengeti), and three tasks (NER, POS, sentiment):
- Moderate predictive power: Both P@1 and CSLS achieve mean Spearman rank correlation with downstream task accuracy (F or overall accuracy).
- Superiority over CKA: Structural Centered Kernel Alignment (CKA) yields much weaker correlation ().
- Comparable to cosine gap: Retrieval-based metrics are on par with, but largely complementary to, cosine gap (mean ).
Table 1: Overall Metric Correlations (NER/POS/Sentiment, models × tasks)
| Metric | Mean | Std | Significant Cases |
|---|---|---|---|
| P@1 | 0.40 | 0.14 | 7/9 |
| CSLS | 0.40 | 0.14 | 6/9 |
| CKA | 0.10 | 0.18 | 2/9 |
4. Task and Model-Specific Insights
The predictive strength of P@1 and CSLS varies by task and model:
- NER: Correlations in the range 0.20–0.58, with top performance observed for AfriBERTa and Serengeti (P@1 up to 0.56, CSLS up to 0.58).
- POS: Consistent correlations (0.41–0.53) across all models.
- Sentiment: Lower, less stable correlations (0.23–0.37), likely reflecting domain mismatch with the test data.
Inter-metric agreement is exceptionally high between P@1 and CSLS ( across 396 language-model combinations), indicating near-interchangeability in practical source-selection scenarios.
Simpson’s Paradox is observed: if correlations are computed across heterogeneous models, correlation signs can reverse (e.g., cosine gap correlation with NER F flips from strongly positive within-model to negative when pooling). A plausible implication is that all retrieval-based metric evaluations must be restricted to per-model analysis for reliable conclusions.
5. Contextual and Practical Considerations
Retrieval-based metrics directly evaluate cross-lingual alignment by testing if gold parallel pairs are each other's nearest neighbors, providing an explicit test of embedding-space quality for transfer tasks. Robust alignment as measured by high P@1 or CSLS is almost always associated with improved zero-shot discriminative task transfer (NER, POS).
Domain specificity is critical: metrics computed with formal, in-domain evaluation data (FLORES-200) do not consistently predict transfer in out-of-domain (e.g., social-media sentiment) settings, especially with small . Practitioners should restrict metric application to parallel data closely matching the intended deployment scenario (Idris et al., 6 Jan 2026).
Practically, P@1 and CSLS can be used to filter candidate source languages or model checkpoints for target tasks. Given the near-perfect redundancy, the choice between them may be guided by computational simplicity (P@1) versus explicit hubness correction (CSLS), with neighborhood size recommended for CSLS.
6. Comparative and Hybrid Approaches
For sources where embedding-based metrics underperform (e.g., AfroXLM-R, mean for sentiment), typological distance-based metrics such as those from URIEL may provide superior predictive power (e.g., POS: URIEL , cosine gap ). Optimal source selection may thus require hybrid ranking strategies leveraging both retrieval-based and typology-driven signals.
Concrete source-selection evidence confirms the effectiveness of retrieval-based metrics: for NER using AfriBERTa, the best source is ranked top-1 by cosine gap 50% of the time (cf. 9% random) and top-3 83% of the time; P@1 and CSLS yield nearly identical performance.
7. Recommendations and Limitations
Retrieval-based metrics offer a transparent, model-aware, and data-specific mechanism for source ranking in cross-lingual transfer, typically providing moderate, model-specific predictiveness (mean for NER and POS). Their main limitations are their reliance on in-domain parallel data and susceptibility to Simpson’s Paradox in pooled analyses. Routine best practice is to validate all metric-task correlations per model, supplementing with typological distance measures where embedding-space similarity is less predictive. These metrics are particularly recommended as a first-pass filter under low-resource conditions, provided that in-domain data and model-specific evaluation are strictly observed (Idris et al., 6 Jan 2026).