Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Denevil: Towards Deciphering and Navigating the Ethical Values of Large Language Models via Instruction Learning (2310.11053v3)

Published 17 Oct 2023 in cs.CL, cs.AI, and cs.CY

Abstract: LLMs have made unprecedented breakthroughs, yet their increasing integration into everyday life might raise societal risks due to generated unethical content. Despite extensive study on specific issues like bias, the intrinsic values of LLMs remain largely unexplored from a moral philosophy perspective. This work delves into ethical values utilizing Moral Foundation Theory. Moving beyond conventional discriminative evaluations with poor reliability, we propose DeNEVIL, a novel prompt generation algorithm tailored to dynamically exploit LLMs' value vulnerabilities and elicit the violation of ethics in a generative manner, revealing their underlying value inclinations. On such a basis, we construct MoralPrompt, a high-quality dataset comprising 2,397 prompts covering 500+ value principles, and then benchmark the intrinsic values across a spectrum of LLMs. We discovered that most models are essentially misaligned, necessitating further ethical value alignment. In response, we develop VILMO, an in-context alignment method that substantially enhances the value compliance of LLM outputs by learning to generate appropriate value instructions, outperforming existing competitors. Our methods are suitable for black-box and open-source models, offering a promising initial step in studying the ethical values of LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Shitong Duan (6 papers)
  2. Xiaoyuan Yi (42 papers)
  3. Peng Zhang (641 papers)
  4. Tun Lu (38 papers)
  5. Xing Xie (220 papers)
  6. Ning Gu (40 papers)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com