I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative Language Models (2306.03423v2)

Published 6 Jun 2023 in cs.AI

Abstract: Since the release of OpenAI's ChatGPT, generative LLMs have attracted extensive public attention. The increased usage has highlighted generative models' broad utility, but also revealed several forms of embedded bias. Some is induced by the pre-training corpus; but additional bias specific to generative models arises from the use of subjective fine-tuning to avoid generating harmful content. Fine-tuning bias may come from individual engineers and company policies, and affects which prompts the model chooses to refuse. In this experiment, we characterize ChatGPT's refusal behavior using a black-box attack. We first query ChatGPT with a variety of offensive and benign prompts (n=1,706), then manually label each response as compliance or refusal. Manual examination of responses reveals that refusal is not cleanly binary, and lies on a continuum; as such, we map several different kinds of responses to a binary of compliance or refusal. The small manually-labeled dataset is used to train a refusal classifier, which achieves an accuracy of 96%. Second, we use this refusal classifier to bootstrap a larger (n=10,000) dataset adapted from the Quora Insincere Questions dataset. With this machine-labeled data, we train a prompt classifier to predict whether ChatGPT will refuse a given question, without seeing ChatGPT's response. This prompt classifier achieves 76% accuracy on a test set of manually labeled questions (n=985). We examine our classifiers and the prompt n-grams that are most predictive of either compliance or refusal. Our datasets and code are available at https://github.com/maxwellreuter/chatgpt-refusals.

References (7)

Citations (4)

View on Semantic Scholar

Summary

The paper presents a method for predicting ChatGPT’s refusal responses by employing logistic regression, random forest, and BERT, with BERT achieving 96.5% accuracy in classification.
The study defines refusals through an eight-category system and reveals that contextual cues and sensitive topics significantly influence refusal behavior.
The analysis underscores the need for increased transparency in AI ethics and improved model alignment with societal norms to tackle inherent biases.

An Analysis of Prompt Refusal Prediction in ChatGPT

In the paper titled "I’m Afraid I Can’t Do That: Predicting Prompt Refusal in Black-Box Generative LLMs," the authors Reuter and Schulze delve into the intriguing topic of how generative models like ChatGPT handle and refuse certain prompts. This investigation stands at the intersection of AI ethics and natural language processing, focusing on the biases embedded in LLMs during training and fine-tuning stages and their impact on refusal behaviors.

Study Methods and Experiments

The authors sought to build models capable of predicting whether ChatGPT would refuse a given prompt. Their approach involved both manual and automated methods, showcasing a robust engagement with diverse datasets.

Dataset Compilation: The authors utilized a variety of sources for prompts, including the Quora Insincere Questions dataset, resulting in a comprehensive training set of 10,000 samples. A smaller hand-labeled dataset was also created, comprising 1,706 prompts and their corresponding ChatGPT responses.
Manual Refusal Labeling: The authors faced the complex task of defining what constitutes a "refusal." Initial binary classification of compliance and refusal evolved into a nuanced eight-category system to capture the variability in ChatGPT's responses.
Classifier Development: Logistic regression, random forest, and BERT models were deployed to predict the refusal behavior of ChatGPT. BERT displayed superior performance, achieving a 96.5% accuracy in classifying responses and a 75.9% accuracy in predicting refusal based solely on prompts.

Key Findings

The exploratory data analysis shed light on the linguistic features most indicative of refusals. Predictive phrases often involved demographic identifiers or sensitive topics like political figures. The paper highlights the inherent complexity in predicting refusals, noting that refusals can manifest across a range of expressions, from outright denial to subtle redirections.

The paper found that the predictors of refusal not only included explicit words like "cannot" and "sorry" but also contextual nuances in prompts indicating controversial or provocative content. This insight points to the sophisticated interplay between input text and the generative model's embedded alignment parameters, which aim to maintain social norms of appropriateness.

Implications and Future Trajectories

The implications of these findings are multifaceted, affecting both theoretical explorations and practical applications of generative models. The research illuminates the need for transparency in AI systems regarding how ethical norms are encoded and suggests pathways to enhance the sensitivity of models to context without imposing overly generalized censorship.

For future research, the authors suggest exploring multi-label classification approaches incorporating multiple annotators to further refine the understanding of refusal dynamics. Such efforts may facilitate an even more granular alignment with societal norms across diverse cultural contexts. Moreover, the potential variability in model behavior due to differing settings, such as temperature during generation, remains an exciting avenue for further exploration.

In conclusion, Reuter and Schulze's research contributes a valuable perspective to the ongoing dialogue about AI ethics, model interpretability, and the societal impacts of machine learning systems. Their work underscores the importance of addressing biases and transparency in AI technologies, a priority of growing significance as these systems become increasingly integrated into everyday information dissemination.

PDF Markdown

Related Papers

GitHub

GitHub - maxwellreuter/chatgpt-refusals: Datasets and code from our paper, where we use machine learning to predict if ChatGPT will refuse a given prompt. (34 stars)

Tweets

https://twitter.com/520078869/status/1666758208864780289

https://twitter.com/38374100/status/1666511364603662343

https://twitter.com/279718877/status/1666597169346871297

YouTube

Show All Videos