Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SaGE: Evaluating Moral Consistency in Large Language Models (2402.13709v2)

Published 21 Feb 2024 in cs.CL and cs.AI
SaGE: Evaluating Moral Consistency in Large Language Models

Abstract: Despite recent advancements showcasing the impressive capabilities of LLMs in conversational systems, we show that even state-of-the-art LLMs are morally inconsistent in their generations, questioning their reliability (and trustworthiness in general). Prior works in LLM evaluation focus on developing ground-truth data to measure accuracy on specific tasks. However, for moral scenarios that often lack universally agreed-upon answers, consistency in model responses becomes crucial for their reliability. To address this issue, we propose an information-theoretic measure called Semantic Graph Entropy (SaGE), grounded in the concept of "Rules of Thumb" (RoTs) to measure a model's moral consistency. RoTs are abstract principles learned by a model and can help explain their decision-making strategies effectively. To this extent, we construct the Moral Consistency Corpus (MCC), containing 50K moral questions, responses to them by LLMs, and the RoTs that these models followed. Furthermore, to illustrate the generalizability of SaGE, we use it to investigate LLM consistency on two popular datasets -- TruthfulQA and HellaSwag. Our results reveal that task-accuracy and consistency are independent problems, and there is a dire need to investigate these issues further.

Evaluating the Moral Consistency of LLMs with Semantic Graph Entropy

Introduction

LLMs have become integral components in AI-driven applications, offering impressive capabilities in conversational systems and beyond. However, the reliability and trustworthiness of these models are under scrutiny, especially concerning their moral consistency. It is crucial for LLMs to generate responses that are not only accurate but also consistent with moral principles across various contexts. In light of this, our discussion revolves around a novel framework intended to assess the moral consistency of LLMs. Utilizing the concept of Rules of Thumb (RoTs) and introducing an information-theoretic measure known as Semantic Graph Entropy (SaGE), this framework endeavors to quantify the ability of LLMs to maintain non-contradictory moral values in semantically similar situations.

Moral Consistency: A Crucial Evaluation Dimension

Moral consistency pertains to an entity's ability to uphold consistent moral values across differing scenarios. For LLMs, exhibiting moral inconsistency catalyzes issues concerning user trust and potential misuse. To bridge this research gap, we introduce the Moral Consistency Corpus (MCC), constituting 50,000 moral questions and the corresponding LLM-generated responses and RoTs. Furthermore, we present the Semantic Graph Entropy (SaGE) metric, which leverages the structural and semantic information within responses to assess consistency.

Semantic Graph Entropy (SaGE): Innovating Evaluation Metrics

SaGE represents an innovative step forward in the evaluation of LLMs' moral consistency. By constructing semantic graphs from RoTs and analyzing their entropy, SaGE provides a nuanced measure of consistency. Preliminary findings indicate that state-of-the-art LLMs exhibit notable moral inconsistency, underscoring a critical area for future research and model development. Interestingly, our analysis also reveals that conventional methods like temperature-based sampling are ineffective at enhancing consistency, suggesting the need for fundamentally different approaches.

Practical Implications and Future Horizons

Our examination extends beyond moral consistency to encompass other cognitive tasks, such as commonsense reasoning and truthful question-answering. A distinct lack of correlation between task accuracy and consistency emphasizes the independent nature of these challenges, advocating for more holistic evaluation frameworks. Encouragingly, preliminary investigations suggest the potential to improve LLM consistency by explicitly incorporating RoTs into response generation. This finding paves the way for more robust and ethically aligned model training methodologies.

Ethical Considerations and Limitations

The ethical dimension of this research merits careful consideration, especially in the generation and use of moral guidelines (RoTs). Our approach is descriptive, aiming to evaluate consistency without making normative judgments on the correctness of the RoTs themselves. Furthermore, the reliance on various NLP tools and models introduces inherited limitations, alongside the computational constraints that bounded our experiments to a selection of 11 LLMs and a restricted number of paraphrases.

Concluding Thoughts

The fidelity of LLMs in moral scenarios is paramount for their trustworthiness and effective real-world deployment. Our introduction of the Semantic Graph Entropy metric and the Moral Consistency Corpus establishes foundational steps toward more rigorous evaluation and development of morally consistent LLMs. Looking ahead, this research underscores the urgent need for innovative model architectures and training paradigms that inherently prioritize moral consistency, ensuring that AI technologies advance in alignment with ethical principles and human values.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Lm-cppf: Paraphrasing-guided data augmentation for contrastive prompt-based few-shot fine-tuning. arXiv preprint arXiv:2305.18169.
  2. Olga Abramov and Tatiana Lokot. 2011. Typology by means of language networks: Applying information theoretic measures to morphological derivation networks. Towards an Information Theory of Complex Networks: Statistical Methods and Applications, pages 321–346.
  3. Towards effective paraphrasing for information disguise. In European Conference on Information Retrieval, pages 331–340. Springer.
  4. Aligning to social norms and values in interactive narratives. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5994–6017.
  5. Alexios Arvanitis and Konstantinos Kalliris. 2020. Consistency and Moral Integrity: A Self-Determination Theory Perspective. Journal of Moral Education, 49(3):1–14. Publisher: Routledge.
  6. Quality controlled paraphrase generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 596–609.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  8. Carter T Butts. 2001. The complexity of social networks: theoretical and empirical findings. Social Networks, 23(1):31–72.
  9. Richmond Campbell and Victor Kumar. 2012. Moral reasoning on the ground. Ethics, 122(2):273–312.
  10. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
  11. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  12. A focus theory of normative conduct: A theoretical refinement and reevaluation of the role of norms in human behavior. In Advances in experimental social psychology, volume 24, pages 201–234. Elsevier.
  13. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  14. Matthias Dehmer and Abbe Mowshowitz. 2011. A history of graph entropy measures. Information Sciences, 181(1):57–78.
  15. Nathan Habib Sheon Han Nathan Lambert Nazneen Rajani Omar Sanseviero Lewis Tunstall Thomas Wolf Edward Beeching, Clémentine Fourrier. 2023. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  16. Measuring and Improving Consistency in Pretrained Language Models. Transactions of the Association for Computational Linguistics, 9:1012–1031.
  17. Evaluating Superhuman Models with Consistency Checks. ArXiv:2306.09983 [cs, stat].
  18. Social Chemistry 101: Learning to Reason about Social and Moral Norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 653–670, Online. Association for Computational Linguistics.
  19. Iason Gabriel. 2020. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437.
  20. David Gauthier. 1987. Overview of a Theory. In David Gauthier, editor, Morals by Agreement, page 0. Oxford University Press.
  21. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. Journal of Artificial Intelligence Research, 77:103–166.
  22. An unsupervised, geometric and syntax-aware quantification of polysemy. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10565–10574.
  23. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112.
  24. Legible normativity for ai alignment: The value of silly rules. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 115–121.
  25. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.
  26. The curious case of neural text degeneration. In International Conference on Learning Representations.
  27. Dual humanness and trust in conversational AI: A person-centered approach. Computers in Human Behavior, 119:106727.
  28. Becel: Benchmark for consistency evaluation of language models. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3680–3696.
  29. Myeongjun Jang and Thomas Lukasiewicz. 2023. Consistency Analysis of ChatGPT. ArXiv:2303.06273 [cs].
  30. Can Machines Learn Morality? The Delphi Experiment. ArXiv:2110.07574 [cs].
  31. When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment. ArXiv:2210.01478 [cs].
  32. When to make exceptions: Exploring language models as accounts of human moral judgment. Advances in neural information processing systems, 35:28458–28473.
  33. Towards neural similarity evaluator. In Workshop on Document Intelligence at NeurIPS 2019.
  34. Masahiro Kaneko and Naoaki Okazaki. 2023. Reducing sequence length by predicting edit operations with large language models. arXiv preprint arXiv:2305.11862.
  35. Learn What NOT to Learn: Towards Generative Safety in Chatbots. Publisher: arXiv Version Number: 2.
  36. ProsocialDialog: A Prosocial Backbone for Conversational Agents. ArXiv:2205.12688 [cs].
  37. ChatGPT’s inconsistent moral advice influences users’ judgment. Sci Rep, 13(1):4569. Number: 1 Publisher: Nature Publishing Group.
  38. Raymond ST Lee. 2020. Artificial intelligence in daily life. Springer.
  39. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  40. TruthfulQA: Measuring How Models Mimic Human Falsehoods. ArXiv:2109.07958 [cs].
  41. Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models’ Alignment. ArXiv:2308.05374 [cs].
  42. Quantifying organization by means of entropy. IEEE communications letters, 12(3):185–187.
  43. Let’s do a thought experiment: Using counterfactuals to improve moral reasoning. arXiv preprint arXiv:2306.14308.
  44. Ruth Barcan Marcus. 1980. Moral dilemmas and consistency. The Journal of Philosophy, 77(3):121–136.
  45. An Integrative Model of Organizational Trust. The Academy of Management Review, 20(3):709–734. Publisher: Academy of Management.
  46. Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1754–1768, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  47. Harold J Morowitz. 1955. Some order-disorder considerations in living systems. The bulletin of mathematical biophysics, 17:81–86.
  48. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126.
  49. OpenAI. 2023. Gpt-4 technical report.
  50. Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark. ArXiv:2304.03279 [cs].
  51. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  52. Herbert James Paton. 1971. The categorical imperative: A study in Kant’s moral philosophy, volume 1023. University of Pennsylvania Press.
  53. Anat Prior and Maayan Geffet. 2019. Mutual information and semantic similarity as predictors of word association strength: Modulation by association type and semantic relation. In Proceedings of EuroCogSci, pages 265–270.
  54. Measuring reliability of large language models through semantic consistency. arXiv preprint arXiv:2211.05853.
  55. Nicolas Rashevsky. 1955. Life, information theory, and topology. The bulletin of mathematical biophysics, 17:229–235.
  56. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. ArXiv:1908.10084 [cs].
  57. Evaluating the Moral Beliefs Encoded in LLMs.
  58. Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423.
  59. On the evaluation metrics for paraphrase generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3178–3190.
  60. A Word on Machine Ethics: A Response to Jiang et al. (2021). ArXiv.
  61. Not all metrics are guilty: Improving nlg evaluation with llm paraphrasing. arXiv preprint arXiv:2305.15067.
  62. Santa Clara University. Consistency and Ethics.
  63. Ethical and social risks of harm from Language Models. ArXiv:2112.04359 [cs].
  64. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
  65. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800.
  66. Jianing Zhou and Suma Bhat. 2021. Paraphrase generation: A survey of the state of the art. In Proceedings of the 2021 conference on empirical methods in natural language processing, pages 5075–5086.
  67. The Moral Integrity Corpus: A Benchmark for Ethical Dialogue Systems. ArXiv:2204.03021 [cs].
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Vamshi Krishna Bonagiri (6 papers)
  2. Sreeram Vennam (7 papers)
  3. Priyanshul Govil (3 papers)
  4. Ponnurangam Kumaraguru (129 papers)
  5. Manas Gaur (59 papers)