Under-Recognition of Target-Language Entities Beyond Mention Statistics
Ascertain why large language models under-recognize entities when queried in target languages regardless of those entities’ multilingual mention statistics in pretraining data, and determine which properties—beyond mention multiplicity—govern source-language confidence, including the roles of knowledge consistency and duplication.
References
But it is unclear why entities in a target language are under-recognized irrespective of their mention statistics in the pretraining data. If not their multilingual multiplicity in pretraining data, what then determines confidence in source? Is knowledge consistency or duplication important?
— Rethinking Cross-lingual Gaps from a Statistical Viewpoint
(2510.15551 - Piratla et al., 17 Oct 2025) in Appendix: What determines the variance of responses? — subsection “(Multilingual) Popularity of Entities is uncorrelated with multilingual accuracy.”