The Art of Saying No: Contextual Noncompliance in Language Models (2407.12043v2)

Published 2 Jul 2024 in cs.CL, cs.AI, and cs.HC

Abstract: Chat-based LLMs are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests (in addition to unsafe requests). To test noncompliance capabilities of LLMs, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts. We find that most existing models show significantly high compliance rates in certain previously understudied categories with models like GPT-4 incorrectly complying with as many as 30% of requests. To address these gaps, we explore different training strategies using a synthetically-generated training set of requests and expected noncompliant responses. Our experiments demonstrate that while direct finetuning of instruction-tuned models can lead to both over-refusal and a decline in general capabilities, using parameter efficient methods like low rank adapters helps to strike a good balance between appropriate noncompliance and other capabilities.

Citations (9)

View on Semantic Scholar

Summary

The paper presents a comprehensive taxonomy categorizing five noncompliance types to highlight critical gaps in current language model responses.
It employs empirical analysis revealing up to 30% unwanted compliance in state-of-the-art models and utilizes synthetic data with LoRA fine-tuning to enhance decision-making.
The study proposes a novel training paradigm that balances refined noncompliance with overall model capability, paving the way for future robust adversarial defenses.

Analyzing Contextual Noncompliance in LLMs

The paper "The Art of Saying No: Contextual Noncompliance in LLMs" explores the imperative need for LLMs to exercise discretion in noncompliance, extending beyond the traditional focus on safety-related refusals. It delineates an innovative taxonomy aimed at guiding when and why LLMs should selectively refuse user requests, while providing empirical evidence for the current gaps and potential remedies in model training methodologies.

Taxonomy of Noncompliance

A crucial contribution of this paper is the introduction of a comprehensive taxonomy that categorizes noncompliance into five major areas: incomplete requests, indeterminate requests, unsupported requests, humanizing requests, and requests bearing safety concerns.

Incomplete Requests address scenarios where the query lacks adequate information or contains false presuppositions.
Indeterminate Requests are concerned with universally unknown information or involve subjective matters where definitive responses do not exist.
Unsupported Requests highlight the technical limitations of models, such as processing or generating certain modalities, handling extensive content beyond model capacity, and operating beyond temporal knowledge cutoffs.
Humanizing Requests refer to prompts that anthropomorphize the model, potentially leading to misleading outputs regarding personal beliefs or experiences.
Requests with Safety Concerns involve the traditional focus on preventing LLMs from generating harmful or offensive content.

Empirical Findings

The paper's empirical analysis demonstrates significant shortcomings in many current models' abilities to effectively handle these categories. Notably, even state-of-the-art models like GPT-4 demonstrated compliance rates of up to 30% in certain categories which should have been refused. This highlights a prevalent gap in the contextual noncompliance of existing LLMs.

Furthermore, the paper underscores the inadequacy of merely employing system prompts instructing noncompliance. While these prompts succeeded to some extent—particularly for queries that present clear safety risks—they were less effective in other nuanced categories like incomplete or unsupported requests.

Methodologies for Optimizing Noncompliance

To address these deficiencies, the authors propose a novel training paradigm that integrates synthetically generated datasets that support noncompliant responses. Their experimental results show a preferred method combining instruction tuning with parameter-efficient techniques like low-rank adaptations (LoRA), which balance the efficacy of noncompliance with maintaining the models' general capabilities.

Critical to this approach is the preference tuning on contrast sets to mitigate exaggerated noncompliance. This allows fine-tuning where models learn to recognize contextually trivial queries and appropriately respond in affirmative cases, further maintaining computational efficiency not afforded by typical fine-tuning methods.

Implications and Future Directions

The implications of this research are significant, both practically and theoretically. It offers a structured methodology to improve trust and user experience by refining how LLMs interact with users, enhancing their capacity to handle nuanced queries that require selective refusal. It also emphasizes the importance of developing a sophisticated understanding of contextual noncompliance that extends beyond simple refusal of unsafe requests.

Theoretically, this work suggests a redefining of the epistemic limitations of LLMs, urging for an ongoing assessment of a model's ability to discern its knowledge boundaries effectively.

For future research, this paper lays the groundwork for exploring the robustness of proposed methodologies against evolving adversarial techniques like jailbreaking. Additionally, it highlights the potential of LoRA continued finetuning as an efficient solution to combat catastrophic forgetting while managing computational resources, inviting further exploration in diverse fine-tuning processes.

Overall, the paper represents a forward-thinking examination of the complexities in designing LLMs capable of nuanced decision-making in noncompliance scenarios, aiming to create a more reliable and user-focused model interaction.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1813769552641028480

https://twitter.com/realmofresearch/status/1813997045084299642

https://twitter.com/GptMaestro/status/1814112918029685058