Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models (2402.16786v2)

Published 26 Feb 2024 in cs.CL and cs.AI

Abstract: Much recent work seeks to evaluate values and opinions in LLMs using multiple-choice surveys and questionnaires. Most of this work is motivated by concerns around real-world LLM applications. For example, politically-biased LLMs may subtly influence society when they are used by millions of people. Such real-world concerns, however, stand in stark contrast to the artificiality of current evaluations: real users do not typically ask LLMs survey questions. Motivated by this discrepancy, we challenge the prevailing constrained evaluation paradigm for values and opinions in LLMs and explore more realistic unconstrained evaluations. As a case study, we focus on the popular Political Compass Test (PCT). In a systematic review, we find that most prior work using the PCT forces models to comply with the PCT's multiple-choice format. We show that models give substantively different answers when not forced; that answers change depending on how models are forced; and that answers lack paraphrase robustness. Then, we demonstrate that models give different answers yet again in a more realistic open-ended answer setting. We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Giuseppe Attanasio. 2023. Simple Generation. https://github.com/MilaNLProc/simple-generation.
  2. Marcel Binz and Eric Schulz. 2023. Using cognitive psychology to understand gpt-3. Proceedings of the National Academy of Sciences, 120(6):e2218523120.
  3. Studying framing effects on political preferences. Doing news framing analysis II: Empirical and theoretical perspectives, pages 27–50.
  4. Dennis Chong and James N Druckman. 2007. Framing theory. Annu. Rev. Polit. Sci., 10:103–126.
  5. Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388.
  6. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9:1012–1031.
  7. Cristina España-Bonet. 2023. Multilingual coarse political stance classification of media. the editorial line of a ChatGPT and bard newspaper. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11757–11777, Singapore. Association for Computational Linguistics.
  8. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11737–11762, Toronto, Canada. Association for Computational Linguistics.
  9. Sasuke Fujimoto and Takemoto Kazuhiro. 2023. Revisiting the political biases of chatgpt. Frontiers in Artificial Intelligence, 6.
  10. Ai in the gray: Exploring moderation policies in dialogic large language models vs. human answers in controversial topics. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 556–565.
  11. The political ideology of conversational ai: Converging evidence on chatgpt’s pro-environmental, left-libertarian orientation. arXiv preprint arXiv:2301.01768.
  12. Aligning ai with shared human values. In International Conference on Learning Representations.
  13. Janus. 2022. Simulators. LessWrong online forum, 2nd September. https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/.
  14. Mistral 7b. arXiv preprint arXiv:2310.06825.
  15. The past, present and better future of feedback learning in large language models for subjective human preferences and values. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2409–2430, Singapore. Association for Computational Linguistics.
  16. Who is GPT-3? an exploration of personality, values and demographics. In Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS), pages 218–227, Abu Dhabi, UAE. Association for Computational Linguistics.
  17. More human than human: Measuring chatgpt political bias. Public Choice, pages 1–21.
  18. Arvind Narayanan and Sayash Kapoor. 2023. Does chatgpt have a liberal bias? https://www.aisnakeoil.com/p/does-chatgpt-have-a-liberal-bias.
  19. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems.
  20. The shifted and the overlooked: A task-oriented investigation of user-gpt interactions. arXiv preprint arXiv:2310.12418.
  21. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263.
  22. David Rozado. 2023a. Danger in the machine: The perils of political and demographic biases embedded in ai systems. Manhattan Institute.
  23. David Rozado. 2023b. The political biases of chatgpt. Social Sciences, 12(3):148.
  24. David Rozado. 2024. The political preferences of llms. arXiv preprint arXiv:2402.01789.
  25. The self-perception and political biases of chatgpt. arXiv preprint arXiv:2304.07333.
  26. Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548.
  27. Evaluating the moral beliefs encoded in llms. In Thirty-seventh Conference on Neural Information Processing Systems.
  28. Role play with large language models. Nature, 623(7987):493–498.
  29. Assessing political inclination of Bangla language models. In Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pages 62–71, Singapore. Association for Computational Linguistics.
  30. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  31. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
  32. Merel van den Broek. 2023. Chatgpt’s left-leaning liberal bias. University of Leiden.
  33. Adversarial glue: A multi-task benchmark for robustness evaluation of language models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  34. On the robustness of chatgpt: An adversarial and out-of-distribution perspective. In ICLR 2023 Workshop on Trustworthy and Reliable Large-Scale Machine Learning Models.
  35. " my answer is c": First-token probabilities do not match text answers in instruction-tuned language models. arXiv preprint arXiv:2402.14499.
  36. Cvalues: Measuring the values of chinese large language models from safety to responsibility. arXiv preprint arXiv:2307.09705.
  37. (inthe)wildchat: 570k chatGPT interaction logs in the wild. In The Twelfth International Conference on Learning Representations.
  38. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations.
  39. Lmsys-chat-1m: A large-scale real-world llm conversation dataset. arXiv preprint arXiv:2309.11998.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Paul Röttger (37 papers)
  2. Valentin Hofmann (21 papers)
  3. Valentina Pyatkin (34 papers)
  4. Musashi Hinck (12 papers)
  5. Hannah Rose Kirk (33 papers)
  6. Hinrich Schütze (250 papers)
  7. Dirk Hovy (57 papers)
Citations (31)

Summary

Towards More Meaningful Evaluations of LLM Values and Opinions

Introduction to the Political Compass Test

The Political Compass Test (PCT) has gained prominence in evaluating the political values and opinions manifested in LLMs. Comprising 62 propositions across diverse topics, the PCT prompts individuals to agree or disagree, strongly or otherwise, without providing a neutral stance. The cumulative responses position individuals on a political spectrum, distinguishing between economic (left-right) and social (libertarian-authoritarian) dimensions. This method, popular for human surveys, has been adopted in recent studies to gauge the political inclinations of LLMs.

Literature Review: Key Findings

A systematic literature review reveals a significant reliance on the PCT to evaluate LLMs, yet with a critical caveat: most studies constrain LLMs to the PCT’s inherent multiple-choice format. This review highlights two crucial findings:

  • The majority of assessments artificially coerce LLMs into accepting the restricted multiple-choice paradigm of the PCT.
  • There exists a notable lack of robustness testing for the constancy of LLM responses to minor variations in input prompts.

These findings underscore the artificiality of current evaluation methods and call into question their real-world applicability and reliability.

Experimental Insights

Variability in Model Responses

Initial experiments illustrate a stark variability in model responses, dependent on the framing of the PCT questions. When not forced into the multiple-choice structure, models often provided invalid or nuanced responses that reflect a broader spectrum of opinions, challenging the binary outcomes typically enforced by the test’s format.

Influence of Forced Choice Prompts

The efficacy of forced-choice prompts in soliciting valid responses varied significantly across models. This variability further emphasizes the artificial constraints introduced by direct applications of the PCT’s format to LLMs and questions the authenticity of the values and opinions thus obtained.

Paraphrase Robustness and Open-ended Responses

Experiments focused on paraphrase robustness revealed considerable shifts in model positions on the political spectrum with minimal changes in prompt phrasing. Additionally, a notable discrepancy emerged between the constrained multiple-choice responses and more realistic open-ended answers, again questioning the reliability and stability of LLMs' stated values and opinions under the current evaluation paradigm.

Discussion and Recommendations

The investigations into the PCT and its application to LLM evaluations illustrate a significant misalignment between the constrained nature of such tests and the more nuanced, varied interactions LLMs are likely to have in real-world applications. The findings advocate for a shift towards more unconstrained, open-ended evaluations that mirror actual user interactions with LLMs. Moreover, the research underscores the necessity for comprehensive robustness testing to ensure the reliability of findings and urges a cautious approach to making broad claims about LLM values and opinions based on constrained evaluations.

Concluding Remarks

The application of the PCT in LLM evaluations, while common, introduces significant limitations that challenge the validity of derived conclusions regarding LLM values and opinions. This paper's findings highlight the critical need for reevaluating current methods toward more realistic and meaningful assessments of LLMs. As the field continues to evolve, it is imperative to develop evaluation frameworks that accurately reflect the complexities of LLM interactions and the diverse range of potential outputs, thereby ensuring that assessments of LLM values and opinions are both reliable and relevant to real-world applications.