Dice Question Streamline Icon: https://streamlinehq.com

Causes of Increased Target-Language Response Variance

Investigate whether the increased response variance observed when large language models answer knowledge-intensive questions in target languages arises as a coping mechanism to prevent perplexity loss escalation due to cross-language factual inconsistencies in pretraining data, and rigorously identify the underlying causes of this variance.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper argues that cross-lingual performance gaps on knowledge-intensive tasks are dominated by variance rather than bias, and presents empirical evidence that target-language responses exhibit higher variance without a systematic bias away from source-language answers.

Despite establishing that variance increases in target settings, the authors note that the origins of this variance remain unexplained and hypothesize a potential mechanism tied to coping with inconsistencies across languages in pretraining data, calling for a dedicated analysis.

References

We demonstrated that variance of responses increases in target but it is unclear what led to it (Appendix). Is increased variance in target a coping mechanism of LLMs to keep perplexity loss from exploding due to cross-language factual inconsistencies in pretraining data? We leave such analysis also for future work.

Rethinking Cross-lingual Gaps from a Statistical Viewpoint (2510.15551 - Piratla et al., 17 Oct 2025) in Discussion, Future Work and limitations (Section: sec:conclusion)