Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robust Knowledge Extraction from Large Language Models using Social Choice Theory (2312.14877v2)

Published 22 Dec 2023 in cs.CL and cs.AI

Abstract: Large-LLMs can support a wide range of applications like conversational agents, creative writing or general query answering. However, they are ill-suited for query answering in high-stake domains like medicine because they are typically not robust - even the same query can result in different answers when prompted multiple times. In order to improve the robustness of LLM queries, we propose using ranking queries repeatedly and to aggregate the queries using methods from social choice theory. We study ranking queries in diagnostic settings like medical and fault diagnosis and discuss how the Partial Borda Choice function from the literature can be applied to merge multiple query results. We discuss some additional interesting properties in our setting and evaluate the robustness of our approach empirically.

Understanding Robustness in AI Language Comprehension

When it comes to AI, particularly LLMs, the promise of providing accurate answers to a variety of questions has opened doors to applications that seemed like science fiction not too long ago. From facilitating conversations to offering assistance in creative writing, these models appear to be a silver bullet. However, there’s a caveat: the robustness of LLM outputs—particularly their consistency—can be problematic, especially in domains where accuracy is paramount, such as medicine and engineering diagnostics.

The Challenge of Varying Answers

A key limitation when using LLMs for applications such as conversational agents or query answering systems is their inconsistency; the same inquiry can yield different results upon multiple prompts. This shortcoming is even more pronounced because LLMs, by design, generate answers regardless of whether they truly "understand" the topic—sometimes leading to what's known as "hallucination" of responses. Moreover, subtle variations in question phrasing or incorporating irrelevant information can skew the outcomes significantly. This variance casts doubt on the reliability of these models in situations where precision is vital.

A Novel Approach to Consistency

To tackle this issue, researchers have proposed a method inspired by social choice theory—a discipline that interprets individual preferences and combines them into a collective choice. To apply this to LLMs, the concept is straightforward: ask the same query multiple times and then use a social choice technique called the Partial Borda Choice function to merge the multiple query results into a single, more reliable answer. This function scores the recurring answers based on their frequency and order of occurrence, leading to a final ranking that represents a collective preference from repeated prompts. For instance, if a particular answer arises relatively consistently, it will score higher, indicating stronger confidence compared to more sporadic responses.

Experimentation and Validation

The approach has been empirically tested with a focus on diagnostic settings, such as medical and technical fault diagnosis, where the causes of particular conditions need to be determined. Here, a query outlining a set of symptoms would be processed multiple times to yield a variety of potential causes, which are then aggregated using the Partial Borda Choice function. The experimental results demonstrated that this method notably improved answer robustness against query repetition and minor syntactic changes, as compared to traditional singular query responses or simpler aggregation strategies.

The Importance of Data Quality and Model Tailoring

The method's effectiveness is influenced not just by the voting system it leverages, but also by the quality of data on which the LLM is trained. While the technique shows promise even with mixed-quality data sources such as the internet, its reliability would be further enhanced if applied to domain-specific models that have been trained on high-quality, peer-reviewed data. In practice, however, the financial and computational resources required for such fine-tuning may be substantial.

Conclusion and Future Directions

This paper affirms the potential of social choice theory as a bridge to a more reliable AI-driven decision-making process. By aggregating answers and thus reducing unpredictable variance, LLMs can step closer to becoming dependable assistants in critical domains. Looking forward, expanding this research to counter other types of uncertainty—caused by injected noise and adversarial attacks—could bolster the robustness of LLMs even further. As LLMs continue to evolve, so too must the methods of interpretation and validation to ensure that they can serve as trusted resources in decision-making processes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Handbook of Computational Social Choice. Cambridge University Press. https://doi.org/10.1017/CBO9781107446984
  2. A Borda count for partially ordered ballots. Social Choice and Welfare 42 (2014), 913–926.
  3. Hierarchical Neural Story Generation. In Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 889–898.
  4. On calibration of modern neural networks. In International Conference on Machine Learning ICML. PMLR, 1321–1330.
  5. The Curious Case of Neural Text Degeneration. In International Conference on Learning Representations ICLR.
  6. Spoken language processing: A guide to theory, algorithm, and system development. Prentice hall PTR.
  7. Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
  8. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics 9 (2021), 962–977.
  9. Language models (mostly) know what they know. Findings of the Association for Computational Linguistics (2023), 8653––8665.
  10. Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika 30, 1/2 (1938), 81–93.
  11. Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. In International Conference on Learning Representations ICLR.
  12. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240.
  13. Teaching Models to Express Their Uncertainty in Words. Transactions on Machine Learning Research (2022).
  14. Reducing Conversational Agents’ Overconfidence Through Linguistic Calibration. Transactions of the Association for Computational Linguistics 10 (2022).
  15. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine 4, 1 (2021), 86.
  16. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing EMNLP-IJCNLP. 3982–3992.
  17. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. Conference on Empirical Methods in Natural Language Processing EMNLP (2023).
  18. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. arXiv preprint arXiv:2306.13063 (2023).
  19. Jerrold H Zar. 2005. Spearman rank correlation. Encyclopedia of Biostatistics 7 (2005).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Nico Potyka (27 papers)
  2. Yuqicheng Zhu (12 papers)
  3. Yunjie He (8 papers)
  4. Evgeny Kharlamov (34 papers)
  5. Steffen Staab (78 papers)
Citations (1)