But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors (2505.17760v1)

Published 23 May 2025 in cs.LG and cs.AI

Abstract: Recent safety evaluations of LLMs show that many models exhibit dishonest behavior, such as sycophancy. However, most honesty benchmarks focus exclusively on factual knowledge or explicitly harmful behavior and rely on external judges, which are often unable to detect less obvious forms of dishonesty. In this work, we introduce a new framework, Judge Using Safety-Steered Alternatives (JUSSA), which utilizes steering vectors trained on a single sample to elicit more honest responses from models, helping LLM-judges in the detection of dishonest behavior. To test our framework, we introduce a new manipulation dataset with prompts specifically designed to elicit deceptive responses. We find that JUSSA enables LLM judges to better differentiate between dishonest and benign responses, and helps them identify subtle instances of manipulative behavior.

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors (2505.17760v1)

Collections

Summary

Follow-up Questions

Authors (4)

Don't miss out on important new AI/ML research

But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors (2505.17760v1)

Collections

Summary

Follow-up Questions

Related Papers

Authors (4)

Don't miss out on important new AI/ML research