Stick to your Role! Stability of Personal Values Expressed in Large Language Models (2402.14846v4)
Abstract: The standard way to study LLMs with benchmarks or psychology questionnaires is to provide many different queries from similar minimal contexts (e.g. multiple choice questions). However, due to LLMs' highly context-dependent nature, conclusions from such minimal-context evaluations may be little informative about the model's behavior in deployment (where it will be exposed to many new contexts). We argue that context-dependence (specifically, value stability) should be studied as a specific property of LLMs and used as another dimension of LLM comparison (alongside others such as cognitive abilities, knowledge, or model size). We present a case-study on the stability of value expression over different contexts (simulated conversations on different topics) as measured using a standard psychology questionnaire (PVQ) and on behavioral downstream tasks. Reusing methods from psychology, we study Rank-order stability on the population (interpersonal) level, and Ipsative stability on the individual (intrapersonal) level. We consider two settings (with and without instructing LLMs to simulate particular personas), two simulated populations, and three downstream tasks. We observe consistent trends in the stability of models and model families - Mixtral, Mistral, GPT-3.5 and Qwen families are more stable than LLaMa-2 and Phi. The consistency of these trends implies that some models exhibit higher value stability than others, and that stability can be estimated with the set of introduced methodological tools. When instructed to simulate particular personas, LLMs exhibit low Rank-order stability, which further diminishes with conversation length. This highlights the need for future research on LLMs that coherently simulate different personas. This paper provides a foundational step in that direction, and, to our knowledge, it is the first study of value stability in LLMs.
- (2022). Moral foundations of large language models. Preprint.
- (2022). Using large language models to simulate multiple humans. arXiv preprint arXiv:2208.10264.
- (2023). Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3), 337–351.
- (2023). Qwen technical report. arXiv preprint arXiv:2309.16609.
- (2023). Using cognitive psychology to understand gpt-3. Proceedings of the National Academy of Sciences, 120(6), e2218523120.
- (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- F.R.S., K. P. (1901). Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559-572. doi: 10.1080/14786440109462720
- (2023). Textbooks are all you need. arXiv preprint arXiv:2306.11644.
- (2020). Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- (2023). Mistral 7b. arXiv preprint arXiv:2310.06825.
- (2024). Mixtral of experts. arXiv preprint arXiv:2401.04088.
- (2022). Mpi: Evaluating and inducing personality in pre-trained language models. arXiv preprint arXiv:2206.07550.
- (2023). Comparing machines and children: Using developmental psychology experiments to assess the strengths and weaknesses of lamda responses. arXiv preprint arXiv:2305.11243.
- (2022). Does gpt-3 demonstrate psychopathy? evaluating large language models from a psychological perspective. arXiv preprint arXiv:2212.10529.
- List of middle-earth characters. (n.d.). Retrieved from https://en.wikipedia.org/wiki/List\_of\_Middle-earth\_characters (Accessed: 2023-11-30)
- List of top 100 famous people. (n.d.). Retrieved from https://www.biographyonline.net/people/famous-100.html (Accessed: 2023-11-30)
- (2023). Gpteach: Interactive ta training with gpt-based students. In Proceedings of the tenth acm conference on learning @ scale (p. 226–236). New York, NY, USA: Association for Computing Machinery. Retrieved from https://doi.org/10.1145/3573051.3593393 doi: 10.1145/3573051.3593393
- (2023). Cultural alignment in large language models: An explanatory analysis based on hofstede’s cultural dimensions. arXiv preprint arXiv:2309.12342.
- (2023). Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology (pp. 1–22).
- (2023). Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- (2023). Personality traits in large language models. arXiv preprint arXiv:2307.00184.
- (2023). Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548.
- Schwartz, S. (1992, 12). Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries. In (Vol. 25, p. 1-65). doi: 10.1016/S0065-2601(08)60281-6
- (2001). Extending the cross-cultural validity of the theory of basic human values with a different method of measurement. Journal of Cross-Cultural Psychology, 32(5), 519-542. Retrieved from https://doi.org/10.1177/0022022101032005001 doi: 10.1177/0022022101032005001
- (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- (2023). Zephyr: Direct distillation of lm alignment.
- (2016, 06). Stability and change of basic personal values in early adulthood: An 8-year longitudinal study. Journal of Research in Personality, 63. doi: 10.1016/j.jrp.2016.06.002
- (2020). Stability and change of basic personal values in early adolescence: A 2-year longitudinal study. Journal of Personality, 88(3), 447-463. Retrieved from https://onlinelibrary.wiley.com/doi/abs/10.1111/jopy.12502 doi: https://doi.org/10.1111/jopy.12502
- Grgur Kovač (8 papers)
- Rémy Portelas (19 papers)
- Masataka Sawayama (6 papers)
- Peter Ford Dominey (8 papers)
- Pierre-Yves Oudeyer (95 papers)