Revealing Fine-Grained Values and Opinions in Large Language Models (2406.19238v2)

Published 27 Jun 2024 in cs.CL, cs.CY, and cs.LG

Abstract: Uncovering latent values and opinions embedded in LLMs can help identify biases and mitigate potential harm. Recently, this has been approached by prompting LLMs with survey questions and quantifying the stances in the outputs towards morally and politically charged statements. However, the stances generated by LLMs can vary greatly depending on how they are prompted, and there are many ways to argue for or against a given position. In this work, we propose to address this by analysing a large and robust dataset of 156k LLM responses to the 62 propositions of the Political Compass Test (PCT) generated by 6 LLMs using 420 prompt variations. We perform coarse-grained analysis of their generated stances and fine-grained analysis of the plain text justifications for those stances. For fine-grained analysis, we propose to identify tropes in the responses: semantically similar phrases that are recurrent and consistent across different prompts, revealing natural patterns in the text that a given LLM is prone to produce. We find that demographic features added to prompts significantly affect outcomes on the PCT, reflecting bias, as well as disparities between the results of tests when eliciting closed-form vs. open domain responses. Additionally, patterns in the plain text rationales via tropes show that similar justifications are repeatedly generated across models and prompts even with disparate stances.

PDF HTML Abstract

Revealing Fine-Grained Values and Opinions in LLMs

This paper investigates the nuanced ways in which LLMs encode values and opinions, particularly focusing on how these can be revealed through their interactions with politically and morally charged propositions. The authors conducted a comprehensive analysis using a robust dataset comprising 156,240 responses from six LLMs to the Political Compass Test (PCT) across 420 different prompt variations.

Goals and Methodology

The primary objectives of the paper were to (1) measure the impact of different demographic prompts on LLM responses, (2) assess the robustness of LLM stances between open-ended and closed-form responses, and (3) identify and analyze the tropes—semantically similar recurring phrases—that LLMs use to justify their stances.

Dataset and Experimental Design

The dataset generation involved varying both demographics and instructions in the prompts. The demographic variables included Age, Gender, Nationality, Political Orientation, and Class. The instructions varied between open-ended and closed-form styles. The researchers achieved a diverse and extensive set of responses, which were essential for conducting both coarse-grained categorical analyses and fine-grained text analyses.

For the categorical stances, each open-ended response was retrospectively categorized using a Mistral-Instruct-v0.2 model to match the closed-form response categories. Meanwhile, the fine-grained analysis was handled through semantic clustering, where responses were broken down into sentences and clustered to identify recurrent tropes, using techniques such as DBSCAN and S-BERT embeddings.

Findings and Analysis

Demographic Impact on LLM Stances

The research demonstrated that the inclusion of demographic information in prompts significantly impacts LLM responses. Political orientation was particularly influential, causing large variances in responses, while other demographics like age and nationality showed weaker effects. Different LLMs exhibited varying levels of susceptibility to these demographic prompts, with models like Llama 3 and Zephyr showing larger shifts in their political compass positions compared to Llama 2 and OLMo.

Robustness Between Open and Closed Responses

The paper revealed notable differences between closed-form and open-ended responses. In the open setting, models often leaned towards neutral or refused to take strong stances, whereas they showed more explicit agreement or disagreement in the closed setting. This discrepancy was especially pronounced when models were prompted with right-leaning political orientations, highlighting potential systematic biases embedded within the LLMs.

Tropes Analysis

By clustering the sentences and identifying tropes, the authors found that LLMs tend to produce consistent patterns of justification across different settings. These tropes, representing thematic consistencies, were shared across multiple models, indicating common underlying biases. For example, tropes like “Strive for an equitable society with equal opportunities” appeared across five out of six models, and similar justifications were observed even when models took different stances.

Implications and Future Directions

Bias Detection and Mitigation

The findings underscore the necessity for a deeper understanding of how demographic features and prompting styles influence LLM outputs. Such insights are crucial for developing methods to detect and mitigate biases in LLMs, ensuring fairer and more reliable AI systems.

Trope-Based Model Assessment

The introduction of trope-based methods for assessing LLM outputs provides a novel avenue for evaluating the latent values and opinions within these models beyond binary or scalar stance measurements. This approach is particularly valuable as it mirrors real-world interactions where justifications and explanations matter as much as stated positions.

Future Research

Future research should extend beyond the Political Compass Test to include more culturally and contextually diverse datasets. Additionally, refining the techniques for detecting and validating tropes will enhance the robustness of fine-grained bias analysis. There is also a need for developing larger and more experimentally diverse LLMs to capture the intricacies of how these models encode and express values.

Conclusion

This paper presents a thorough investigation into the fine-grained values and opinions embedded in LLMs. Through a combination of large-scale data generation and innovative analytical techniques, the research reveals how demographic prompts and response formats shape LLM outputs. The paper highlights both theoretical and practical implications for bias detection and mitigation in AI, providing a solid foundation for future exploration in this critical area.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Dustin Wright (19 papers)
Arnav Arora (24 papers)
Nadav Borenstein (13 papers)
Srishti Yadav (10 papers)
Serge Belongie (125 papers)
Isabelle Augenstein (131 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/dustin_wright37/status/1806597255006203976

https://twitter.com/dustin_wright37/status/1852279256459473032

https://twitter.com/fly51fly/status/1806823924476543341

https://twitter.com/dustin_wright37/status/1809364504670781664

https://twitter.com/javaeeeee1/status/1809968324085686758

https://twitter.com/WGOV/status/1852379359987835161