Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models (2402.13606v3)

Published 21 Feb 2024 in cs.CL

Abstract: The tendency of LLMs to generate hallucinations raises concerns regarding their reliability. Therefore, confidence estimations indicating the extent of trustworthiness of the generations become essential. However, current LLM confidence estimations in languages other than English remain underexplored. This paper addresses this gap by introducing a comprehensive investigation of Multilingual Confidence estimation (MlingConf) on LLMs, focusing on both language-agnostic (LA) and language-specific (LS) tasks to explore the performance and language dominance effects of multilingual confidence estimations on different tasks. The benchmark comprises four meticulously checked and human-evaluate high-quality multilingual datasets for LA tasks and one for the LS task tailored to specific social, cultural, and geographical contexts of a language. Our experiments reveal that on LA tasks English exhibits notable linguistic dominance in confidence estimations than other languages, while on LS tasks, using question-related language to prompt LLMs demonstrates better linguistic dominance in multilingual confidence estimations. The phenomena inspire a simple yet effective native-tone prompting strategy by employing language-specific prompts for LS tasks, effectively improving LLMs' reliability and accuracy on LS tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Anonymous. 2024. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. In The Twelfth International Conference on Learning Representations.
  2. CrossSum: Beyond English-centric cross-lingual summarization for 1,500+ language pairs. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2541–2564, Toronto, Canada. Association for Computational Linguistics.
  3. Mapping the origins and expansion of the indo-european language family. Science, 337(6097):957–960.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  5. Andreea S Calude and Mark Pagel. 2011. How do we use language? shared patterns in the frequency of word use across 17 world languages. Philosophical Transactions of the Royal Society B: Biological Sciences, 366(1567):1101–1107.
  6. Jiuhai Chen and Jonas Mueller. 2023. Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. arXiv preprint arXiv:2308.16175.
  7. Palm: Scaling language modeling with pathways.
  8. Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 295–302, Online. Association for Computational Linguistics.
  9. Shifting attention to relevance: Towards the uncertainty estimation of large language models.
  10. Benchmarking bayesian deep learning with diabetic retinopathy diagnosis. Preprint at https://arxiv. org/abs/1912.10481.
  11. A survey of language model confidence estimation and calibration.
  12. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330. PMLR.
  13. Uncertainty in natural language processing: Sources, quantification, and applications.
  14. Not all languages are created equal in llms: Improving multilingual capability by cross-lingual-thought prompting.
  15. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
  16. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977.
  17. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
  18. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
  19. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations.
  20. Teaching models to express their uncertainty in words. Transactions on Machine Learning Research.
  21. Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187.
  22. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
  23. Andrey Malinin and Mark Gales. 2021. Uncertainty estimation in autoregressive structured prediction. In International Conference on Learning Representations.
  24. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857–872.
  25. Kenton Murray and David Chiang. 2018. Correcting length bias in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 212–223, Brussels, Belgium. Association for Computational Linguistics.
  26. OpenAI. 2023. Gpt-4 technical report.
  27. How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Florence, Italy. Association for Computational Linguistics.
  28. Cross-lingual prompting: Improving zero-shot chain-of-thought reasoning across languages. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2695–2709, Singapore. Association for Computational Linguistics.
  29. GL-CLeF: A global–local contrastive learning framework for cross-lingual spoken language understanding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2677–2686, Dublin, Ireland. Association for Computational Linguistics.
  30. A survey of hallucination in large foundation models.
  31. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations.
  32. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5433–5442, Singapore. Association for Computational Linguistics.
  33. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975.
  34. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987.
  35. Efficient out-of-domain detection for sequence to sequence models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1430–1454, Toronto, Canada. Association for Computational Linguistics.
  36. Zero-shot cross-lingual summarization via large language models.
  37. Towards unifying multi-lingual and cross-lingual summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15127–15143, Toronto, Canada. Association for Computational Linguistics.
  38. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
  39. Bloom: A 176b-parameter open-access multilingual language model.
  40. Uncertainty quantification with pre-trained language models: A large-scale empirical analysis. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 7273–7284, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  41. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.
  42. Greenplm: Cross-lingual transfer of monolingual pre-trained language models at almost no cost. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 6290–6298. International Joint Conferences on Artificial Intelligence Organization. AI for Good.
  43. Opt: Open pre-trained transformer language models.
  44. Navigating the grey area: Expressions of overconfidence and uncertainty in language models. arXiv preprint arXiv:2302.13439.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Boyang Xue (23 papers)
  2. Hongru Wang (62 papers)
  3. Rui Wang (996 papers)
  4. Sheng Wang (239 papers)
  5. Kam-Fai Wong (92 papers)
  6. Zezhong Wang (30 papers)
  7. Yiming Du (13 papers)
  8. Bin Liang (115 papers)