Revealing Fine-Grained Values and Opinions in LLMs
This paper investigates the nuanced ways in which LLMs encode values and opinions, particularly focusing on how these can be revealed through their interactions with politically and morally charged propositions. The authors conducted a comprehensive analysis using a robust dataset comprising 156,240 responses from six LLMs to the Political Compass Test (PCT) across 420 different prompt variations.
Goals and Methodology
The primary objectives of the paper were to (1) measure the impact of different demographic prompts on LLM responses, (2) assess the robustness of LLM stances between open-ended and closed-form responses, and (3) identify and analyze the tropes—semantically similar recurring phrases—that LLMs use to justify their stances.
Dataset and Experimental Design
The dataset generation involved varying both demographics and instructions in the prompts. The demographic variables included Age, Gender, Nationality, Political Orientation, and Class. The instructions varied between open-ended and closed-form styles. The researchers achieved a diverse and extensive set of responses, which were essential for conducting both coarse-grained categorical analyses and fine-grained text analyses.
For the categorical stances, each open-ended response was retrospectively categorized using a Mistral-Instruct-v0.2 model to match the closed-form response categories. Meanwhile, the fine-grained analysis was handled through semantic clustering, where responses were broken down into sentences and clustered to identify recurrent tropes, using techniques such as DBSCAN and S-BERT embeddings.
Findings and Analysis
Demographic Impact on LLM Stances
The research demonstrated that the inclusion of demographic information in prompts significantly impacts LLM responses. Political orientation was particularly influential, causing large variances in responses, while other demographics like age and nationality showed weaker effects. Different LLMs exhibited varying levels of susceptibility to these demographic prompts, with models like Llama 3 and Zephyr showing larger shifts in their political compass positions compared to Llama 2 and OLMo.
Robustness Between Open and Closed Responses
The paper revealed notable differences between closed-form and open-ended responses. In the open setting, models often leaned towards neutral or refused to take strong stances, whereas they showed more explicit agreement or disagreement in the closed setting. This discrepancy was especially pronounced when models were prompted with right-leaning political orientations, highlighting potential systematic biases embedded within the LLMs.
Tropes Analysis
By clustering the sentences and identifying tropes, the authors found that LLMs tend to produce consistent patterns of justification across different settings. These tropes, representing thematic consistencies, were shared across multiple models, indicating common underlying biases. For example, tropes like “Strive for an equitable society with equal opportunities” appeared across five out of six models, and similar justifications were observed even when models took different stances.
Implications and Future Directions
Bias Detection and Mitigation
The findings underscore the necessity for a deeper understanding of how demographic features and prompting styles influence LLM outputs. Such insights are crucial for developing methods to detect and mitigate biases in LLMs, ensuring fairer and more reliable AI systems.
Trope-Based Model Assessment
The introduction of trope-based methods for assessing LLM outputs provides a novel avenue for evaluating the latent values and opinions within these models beyond binary or scalar stance measurements. This approach is particularly valuable as it mirrors real-world interactions where justifications and explanations matter as much as stated positions.
Future Research
Future research should extend beyond the Political Compass Test to include more culturally and contextually diverse datasets. Additionally, refining the techniques for detecting and validating tropes will enhance the robustness of fine-grained bias analysis. There is also a need for developing larger and more experimentally diverse LLMs to capture the intricacies of how these models encode and express values.
Conclusion
This paper presents a thorough investigation into the fine-grained values and opinions embedded in LLMs. Through a combination of large-scale data generation and innovative analytical techniques, the research reveals how demographic prompts and response formats shape LLM outputs. The paper highlights both theoretical and practical implications for bias detection and mitigation in AI, providing a solid foundation for future exploration in this critical area.