Explain the MSMARCO metric discrepancy for Jasper embeddings

Determine the cause of the MSMARCO evaluation discrepancy observed for the Jasper text embedding model (jasper_en_vision_language_v1), namely why Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR) are high while Mean Average Precision (MAP) is very low, given that the teacher models stella_en_1.5B_v5 and NV-Embed-v2 do not exhibit this behavior.

Background

The paper reports that Jasper, a distilled embedding model trained from stella_en_1.5B_v5 and NV-Embed-v2, shows an anomalous evaluation on the MSMARCO benchmark: NDCG and MRR are reported as strong, yet MAP is very low. This inconsistency is not observed in the teacher models.

Understanding why MAP diverges from NDCG and MRR for Jasper is important for diagnosing potential issues in training, evaluation setup, or retrieval behavior (e.g., ranking calibration, score scaling, or index/query preprocessing) and for improving reliability across metrics.

References

After releasing the Jasper model, an enthusiastic user (user name is raghavlite, https://huggingface.co/raghavlite) points out that the NDCG/MRR score is perfect and the MAP score is very low. Jasper model is distilled from stella_en_1.5B_v5 and NV-Embed-v2 and their MSMARCO score do not have this appearance. As of now, we still haven't been able to figure out what happened.

— Jasper and Stella: distillation of SOTA embedding models (2412.19048 - Zhang et al., 26 Dec 2024) in Discussion, Subsection "The Inconsistency of MSMARCO Scores"

Explain the MSMARCO metric discrepancy for Jasper embeddings

Sponsor

Background

References

Related Problems