Dice Question Streamline Icon: https://streamlinehq.com

Under-Recognition of Target-Language Entities Beyond Mention Statistics

Ascertain why large language models under-recognize entities when queried in target languages regardless of those entities’ multilingual mention statistics in pretraining data, and determine which properties—beyond mention multiplicity—govern source-language confidence, including the roles of knowledge consistency and duplication.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper identifies entities as a hotspot for cross-lingual factuality gaps and shows that substituting source-language entities in target queries (SBET) recovers a large fraction of the gap, suggesting that entity handling is central to the problem.

They further observe a poor correlation between an entity’s multilingual web mentions and multilingual accuracy, indicating that mention frequency alone does not explain recognition or confidence, motivating investigation into other governing factors.

References

But it is unclear why entities in a target language are under-recognized irrespective of their mention statistics in the pretraining data. If not their multilingual multiplicity in pretraining data, what then determines confidence in source? Is knowledge consistency or duplication important?

Rethinking Cross-lingual Gaps from a Statistical Viewpoint (2510.15551 - Piratla et al., 17 Oct 2025) in Appendix: What determines the variance of responses? — subsection “(Multilingual) Popularity of Entities is uncorrelated with multilingual accuracy.”