Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dishonesty in Helpful and Harmless Alignment (2406.01931v2)

Published 4 Jun 2024 in cs.CL

Abstract: People tell lies when seeking rewards. LLMs are aligned to human values with reinforcement learning where they get rewards if they satisfy human preference. We find that this also induces dishonesty in helpful and harmless alignment where LLMs tell lies in generating harmless responses. Using the latest interpreting tools, we detect dishonesty, show how LLMs can be harmful if their honesty is increased, and analyze such conflicts at the parameter-level. Given these preliminaries and the hypothesis that reward-seeking stimulates dishonesty, we theoretically show that the dishonesty can in-turn decrease the alignment performances and augment reward-seeking alignment with representation regularization. Extensive results, including GPT-4 annotated win-rates, perplexities, and cases studies demonstrate that we can train more honest, helpful, and harmless LLMs. We will make all our codes and results be open-sourced upon this paper's acceptance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Youcheng Huang (9 papers)
  2. Jingkun Tang (2 papers)
  3. Duanyu Feng (13 papers)
  4. Zheng Zhang (486 papers)
  5. Wenqiang Lei (66 papers)
  6. Jiancheng Lv (99 papers)
  7. Anthony G. Cohn (24 papers)
Citations (1)