Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Better Angels of Machine Personality: How Personality Relates to LLM Safety (2407.12344v1)

Published 17 Jul 2024 in cs.CL and cs.CY

Abstract: Personality psychologists have analyzed the relationship between personality and safety behaviors in human society. Although LLMs demonstrate personality traits, the relationship between personality traits and safety abilities in LLMs still remains a mystery. In this paper, we discover that LLMs' personality traits are closely related to their safety abilities, i.e., toxicity, privacy, and fairness, based on the reliable MBTI-M scale. Meanwhile, the safety alignment generally increases various LLMs' Extraversion, Sensing, and Judging traits. According to such findings, we can edit LLMs' personality traits and improve their safety performance, e.g., inducing personality from ISTJ to ISTP resulted in a relative improvement of approximately 43% and 10% in privacy and fairness performance, respectively. Additionally, we find that LLMs with different personality traits are differentially susceptible to jailbreak. This study pioneers the investigation of LLM safety from a personality perspective, providing new insights into LLM safety enhancement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (111)
  1. Is cognition and action consistent or not: Investigating large language model’s personality. arXiv preprint arXiv:2402.14679, 2024.
  2. Culture and personality. Oxford handbook of culture and psychology, pages 401–424, 2019.
  3. Evidence on the homogeneity of personality traits within the auditing profession. Critical Perspectives on Accounting, page 102584, 2023.
  4. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  5. Differential privacy has disparate impact on model accuracy. Advances in neural information processing systems, 32, 2019.
  6. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  7. A meta-analysis of personality and workplace safety: addressing unanswered questions. Journal of applied psychology, 100(2):481, 2015.
  8. Workplace safety: A review and research synthesis. Organizational psychology review, 6(4):352–381, 2016.
  9. The policy relevance of personality traits. The American psychologist, 74 9:1056–1067, 2019.
  10. Katharine C Briggs. Myers-Briggs type indicator. Consulting Psychologists Press Palo Alto, CA, 1976.
  11. The myers-briggs type indicator and transformational leadership. Journal of Management Development, 28(10):916–932, 2009.
  12. The Importance of Agreeableness, pages 1–6. Springer International Publishing, Cham, 2018.
  13. Myers-briggs type indicator score reliability across: Studies a meta-analytic reliability generalization study. Educational and Psychological Measurement, 62(4):590–602, 2002.
  14. Performance, personality, and energetics: correlation, causation, and mechanism. Physiological and Biochemical Zoology, 85(6):543–571, 2012.
  15. John G Carlson. Recent assessments of the myers-briggs type indicator. Journal of personality assessment, 49(4):356–365, 1985.
  16. Evaluation metrics for language models. 1998.
  17. Does language affect personality perception? a functional approach to testing the whorfian hypothesis. Journal of personality, 82(2):130–143, 2014.
  18. A Timothy Church. Personality traits across cultures. Current Opinion in Psychology, 8:22–30, 2016.
  19. Jacob Cohen. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46, 1960.
  20. European Commission. Proposal for a regulation of the european parliament and of the council laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts, pub. l. no. com(2021) 206 final., 2021b.
  21. Ethics guidelines for trustworthy AI. Publications Office, 2019.
  22. Trait theories of personality. In Advanced personality, pages 103–121. Springer, 1998.
  23. Machine mindset: An mbti exploration of large language models. arXiv preprint arXiv:2312.12999, 2023.
  24. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023.
  25. Do personality tests generalize to large language models? arXiv preprint arXiv:2311.05297, 2023.
  26. Seymour Epstein. Trait theory as personality theory: Can a part be as great as the whole? Psychological Inquiry, 5(2):120–122, 1994.
  27. Hans Jurgen Eysenck and Sybil Bianca Giuletta Eysenck. Manual of the Eysenck Personality Questionnaire (junior & adult). Hodder and Stoughton Educational, 1975.
  28. AI Verify Foundation. Catalogue of llm evaluations, 2023.
  29. A framework for few-shot language model evaluation, 12 2023.
  30. Jennifer M George. Personality, affect, and behavior in groups. Journal of applied psychology, 75(2):107, 1990.
  31. Toxigen: A large-scale machine-generated dataset for implicit and adversarial hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022.
  32. Afspp: Agent framework for shaping preference and personality with large language models. arXiv preprint arXiv:2401.02870, 2024.
  33. Malcolm Higgs. Is there a relationship between the myers-briggs type indicator and emotional intelligence? Journal of Managerial Psychology, 16(7):509–533, 2001.
  34. Multifaceted personality predictors of workplace safety performance: More than conscientiousness. Human Performance, 26(1):20–43, 2013.
  35. An empirical study of metrics to measure representational harms in pre-trained language models. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 121–134, 2023.
  36. Jill R Hough and DT Ogilvie. An empirical test of cognitive style and strategic decision outcomes. Journal of Management Studies, 42(2):417–448, 2005.
  37. Who is chatgpt? benchmarking llms’ psychological portrayal using psychobench. arXiv preprint arXiv:2310.01386, 2023.
  38. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852, 2023.
  39. Evaluating and inducing personality in pre-trained language models. Advances in Neural Information Processing Systems, 36, 2024.
  40. Big five inventory. Journal of personality and social psychology, 1991.
  41. Personality traits across countries: Support for similarities rather than differences. PloS one, 12(6):e0179646, 2017.
  42. Otto F Kernberg. What is personality? Journal of personality disorders, 30(2):145–156, 2016.
  43. Inherent trade-offs in the fair determination of risk scores. In Innovations in Theoretical Computer Science (ITCS), 2017.
  44. Personality and safety citizenship: the role of safety motivation and safety knowledge. Heliyon, 6(1), 2020.
  45. The relations between personality and language use. The Journal of general psychology, 134(4):405–413, 2007.
  46. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  47. James J Lee. Correlation and causation in the study of personality. European Journal of Personality, 26(4):372–390, 2012.
  48. Inference-time intervention: Eliciting truthful answers from a language model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  49. Can multiple-choice questions really be useful in detecting the abilities of llms? arXiv preprint arXiv:2403.17752, 2024.
  50. Leveraging word guessing games to assess the intelligence of large language models. arXiv preprint arXiv:2310.20499, 2023.
  51. Trustworthy ai: A computational perspective. ACM Transactions on Intelligent Systems and Technology, page 1–59, Feb 2023.
  52. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment, 2023.
  53. Exploring the sensitivity of LLMs’ decision-making capabilities: Insights from prompt variations and hyperparameters. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3711–3716, Singapore, December 2023. Association for Computational Linguistics.
  54. Illuminating the black box: A psychometric investigation into the multifaceted nature of large language models. arXiv preprint arXiv:2312.14202, 2023.
  55. Codechameleon: Personalized encryption framework for jailbreaking large language models. arXiv preprint arXiv:2402.16717, 2024.
  56. Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  57. Differential privacy has bounded impact on fairness in classification. In International Conference on Machine Learning, pages 23681–23705, 2023.
  58. Editing personality for llms. arXiv preprint arXiv:2310.02168, 2023.
  59. Joint factors in self-reports and ratings: Neuroticism, extraversion and openness to experience. Personality and Individual Differences, 4(3):245–255, 1983.
  60. Reinterpreting the myers-briggs type indicator from the perspective of the five-factor model of personality. Journal of personality, 57(1):17–40, 1989.
  61. Can llms keep a secret? testing privacy implications of language models via contextual integrity theory, 2023.
  62. A self-refinement strategy for noise reduction in grammatical error correction. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 267–280, Online, November 2020. Association for Computational Linguistics.
  63. Isabel Briggs Myers. MBTI manual: A guide to the development and use of the Myers-Briggs Type Indicator. CPP, 2003.
  64. Myers-Briggs Type Indicator : form M. Consulting Psychologists Press, Palo Alto, Calif., 1998.
  65. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, 2021.
  66. Personality and domain-specific risk taking. Journal of Risk Research, 8(2):157–176, 2005.
  67. Communicator image and Myers-Briggs Type Indicator extraversion-introversion. J Psychol, 137(6):560–568, Nov 2003.
  68. Communicator image and myers—briggs type indicator extraversion—introversion. The Journal of psychology, 137(6):560–568, 2003.
  69. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  70. Katarzyna Ożańska-Ponikwia. What has personality and emotional intelligence to do with ‘feeling different’while using a foreign language? International Journal of Bilingual Education and Bilingualism, 15(2):217–234, 2012.
  71. What affects the usage of artificial conversational agents? an agent personality and love theory perspective. Computers in Human Behavior, 145:107788, 2023.
  72. Do llms possess a personality? making the mbti test an amazing evaluation for large language models. arXiv preprint arXiv:2307.16180, 2023.
  73. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, 2016.
  74. Personality and safety behavior: An analysis of worldwide research on road and traffic safety leading to organizational and policy implications. Journal of Business Research, 151:185–196, 2022.
  75. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
  76. Towards tracing trustworthiness dynamics: Revisiting pre-training period of large language models. arXiv preprint arXiv:2402.19465, 2024.
  77. Language models are unsupervised multitask learners. 2019.
  78. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  79. Learning to model editing processes. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3822–3832, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
  80. William Revelle. Experimental approaches to the study of personality. Handbook of research methods in personality psychology, pages 37–61, 2007.
  81. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting, 2023.
  82. Jailbreaking language models at scale via persona modulation. 2023.
  83. Character-llm: A trainable agent for role-playing. arXiv preprint arXiv:2310.10158, 2023.
  84. NN Sharan and DM Romano. The effects of personality and locus of control on trust in humans versus artificial intelligence. heliyon, 6 (8), e04572, 2020.
  85. Privacy risks of securing machine learning models against adversarial examples. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pages 241–257, 2019.
  86. Identifying multiple personalities in large language models with external evaluation. arXiv preprint arXiv:2402.14805, 2024.
  87. Have large language models developed a personality?: Applicability of self-assessment tests in measuring personality in llms. arXiv preprint arXiv:2305.14693, 2023.
  88. Llms simulate big five personality traits: Further evidence. arXiv preprint arXiv:2402.01765, 2024.
  89. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024.
  90. An evolutionary model of personality traits related to cooperative behavior using a large language model. Scientific Reports, 14(1):5989, 2024.
  91. Elham Tabassi. Artificial intelligence risk management framework (ai rmf 1.0), 2023-01-26 05:01:00 2023.
  92. Phantom: Personality has an effect on theory-of-mind reasoning in large language models. arXiv preprint arXiv:2403.02246, 2024.
  93. The predictors of unsafe behaviors among nuclear power plant workers: An investigation integrating personality, cognitive and attitudinal factors. International Journal of Environmental Research and Public Health, 20(1):820, 2023.
  94. Revisiting the reliability of psychological scales on large language models, 2023.
  95. Characterchat: Learning towards conversational ai with personalized social support. arXiv preprint arXiv:2308.10278, 2023.
  96. Is personality modulated by language? International Journal of Bilingualism, 17(4):496–504, 2013.
  97. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  98. Backdoor activation attack: Attack large language models using activation steering for safety-alignment. arXiv preprint arXiv:2311.09433, 2023.
  99. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023.
  100. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews, 2024.
  101. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
  102. Controllm: Crafting diverse personalities for language models. arXiv preprint arXiv:2402.10151, 2024.
  103. Do changes in personality predict life outcomes? Journal of Personality and Social Psychology, 125, 06 2023.
  104. To be robust or to be fair: Towards fairness in adversarial training. In International conference on machine learning, pages 11492–11501, 2021.
  105. Investigating the effects of personality on the safety behavior of gold mine workers: A moderated mediation approach. International journal of environmental research and public health, 19(23):16054, 2022.
  106. Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949, 2023.
  107. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023.
  108. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  109. Easyjailbreak: A unified framework for jailbreaking large language models. arXiv preprint arXiv:2403.12171, 2024.
  110. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  111. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.
Citations (5)

Summary

We haven't generated a summary for this paper yet.