Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models (2402.16786v2)

Published 26 Feb 2024 in cs.CL and cs.AI

Abstract: Much recent work seeks to evaluate values and opinions in LLMs using multiple-choice surveys and questionnaires. Most of this work is motivated by concerns around real-world LLM applications. For example, politically-biased LLMs may subtly influence society when they are used by millions of people. Such real-world concerns, however, stand in stark contrast to the artificiality of current evaluations: real users do not typically ask LLMs survey questions. Motivated by this discrepancy, we challenge the prevailing constrained evaluation paradigm for values and opinions in LLMs and explore more realistic unconstrained evaluations. As a case study, we focus on the popular Political Compass Test (PCT). In a systematic review, we find that most prior work using the PCT forces models to comply with the PCT's multiple-choice format. We show that models give substantively different answers when not forced; that answers change depending on how models are forced; and that answers lack paraphrase robustness. Then, we demonstrate that models give different answers yet again in a more realistic open-ended answer setting. We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.

References (39)

Authors (7)

Paul Röttger (37 papers)
Valentin Hofmann (21 papers)
Valentina Pyatkin (34 papers)
Musashi Hinck (12 papers)
Hannah Rose Kirk (33 papers)
Hinrich Schütze (250 papers)
Dirk Hovy (57 papers)

Citations (31)

View on Semantic Scholar

Summary

Towards More Meaningful Evaluations of LLM Values and Opinions

Introduction to the Political Compass Test

The Political Compass Test (PCT) has gained prominence in evaluating the political values and opinions manifested in LLMs. Comprising 62 propositions across diverse topics, the PCT prompts individuals to agree or disagree, strongly or otherwise, without providing a neutral stance. The cumulative responses position individuals on a political spectrum, distinguishing between economic (left-right) and social (libertarian-authoritarian) dimensions. This method, popular for human surveys, has been adopted in recent studies to gauge the political inclinations of LLMs.

Literature Review: Key Findings

A systematic literature review reveals a significant reliance on the PCT to evaluate LLMs, yet with a critical caveat: most studies constrain LLMs to the PCT’s inherent multiple-choice format. This review highlights two crucial findings:

The majority of assessments artificially coerce LLMs into accepting the restricted multiple-choice paradigm of the PCT.
There exists a notable lack of robustness testing for the constancy of LLM responses to minor variations in input prompts.

These findings underscore the artificiality of current evaluation methods and call into question their real-world applicability and reliability.

Experimental Insights

Variability in Model Responses

Initial experiments illustrate a stark variability in model responses, dependent on the framing of the PCT questions. When not forced into the multiple-choice structure, models often provided invalid or nuanced responses that reflect a broader spectrum of opinions, challenging the binary outcomes typically enforced by the test’s format.

Influence of Forced Choice Prompts

The efficacy of forced-choice prompts in soliciting valid responses varied significantly across models. This variability further emphasizes the artificial constraints introduced by direct applications of the PCT’s format to LLMs and questions the authenticity of the values and opinions thus obtained.

Paraphrase Robustness and Open-ended Responses

Experiments focused on paraphrase robustness revealed considerable shifts in model positions on the political spectrum with minimal changes in prompt phrasing. Additionally, a notable discrepancy emerged between the constrained multiple-choice responses and more realistic open-ended answers, again questioning the reliability and stability of LLMs' stated values and opinions under the current evaluation paradigm.

Discussion and Recommendations

The investigations into the PCT and its application to LLM evaluations illustrate a significant misalignment between the constrained nature of such tests and the more nuanced, varied interactions LLMs are likely to have in real-world applications. The findings advocate for a shift towards more unconstrained, open-ended evaluations that mirror actual user interactions with LLMs. Moreover, the research underscores the necessity for comprehensive robustness testing to ensure the reliability of findings and urges a cautious approach to making broad claims about LLM values and opinions based on constrained evaluations.

Concluding Remarks

The application of the PCT in LLM evaluations, while common, introduces significant limitations that challenge the validity of derived conclusions regarding LLM values and opinions. This paper's findings highlight the critical need for reevaluating current methods toward more realistic and meaningful assessments of LLMs. As the field continues to evolve, it is imperative to develop evaluation frameworks that accurately reflect the complexities of LLM interactions and the diverse range of potential outputs, thereby ensuring that assessments of LLM values and opinions are both reliable and relevant to real-world applications.