I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative Language Models

Published 6 Jun 2023 in cs.AI | (2306.03423v2)

Abstract: Since the release of OpenAI's ChatGPT, generative LLMs have attracted extensive public attention. The increased usage has highlighted generative models' broad utility, but also revealed several forms of embedded bias. Some is induced by the pre-training corpus; but additional bias specific to generative models arises from the use of subjective fine-tuning to avoid generating harmful content. Fine-tuning bias may come from individual engineers and company policies, and affects which prompts the model chooses to refuse. In this experiment, we characterize ChatGPT's refusal behavior using a black-box attack. We first query ChatGPT with a variety of offensive and benign prompts (n=1,706), then manually label each response as compliance or refusal. Manual examination of responses reveals that refusal is not cleanly binary, and lies on a continuum; as such, we map several different kinds of responses to a binary of compliance or refusal. The small manually-labeled dataset is used to train a refusal classifier, which achieves an accuracy of 96%. Second, we use this refusal classifier to bootstrap a larger (n=10,000) dataset adapted from the Quora Insincere Questions dataset. With this machine-labeled data, we train a prompt classifier to predict whether ChatGPT will refuse a given question, without seeing ChatGPT's response. This prompt classifier achieves 76% accuracy on a test set of manually labeled questions (n=985). We examine our classifiers and the prompt n-grams that are most predictive of either compliance or refusal. Our datasets and code are available at https://github.com/maxwellreuter/chatgpt-refusals.

Abstract PDF HTML Upgrade to Chat

References (7)

Citations (4)

View on Semantic Scholar

Summary

The paper predicts prompt refusal behavior in ChatGPT by employing a binary classification approach on a hand-labeled dataset of 1,706 prompts and additional machine-labeled data.
The methodology leverages BERT, logistic regression, and random forest models, with BERT achieving a 96.5% accuracy in refusal classification and 75.9% in prediction.
Results show that specific n-grams and controversial terms are strong predictors of refusals, underlining the need for transparent fine-tuning and ethical AI practices.

Predicting Prompt Refusal in Black-Box Generative LLMs

Introduction

The paper "I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative LLMs" (2306.03423) examines the behavior of OpenAI's ChatGPT, particularly focusing on the instances where it refuses to respond to certain prompts. This work addresses the biases introduced during the fine-tuning process, where subjective decisions impact which prompts are likely to be refused. By employing a black-box methodology, the study aims to predict instances of refusal and develop effective classifiers that account for the linguistic subtleties inherent in prompt-based interactions.

Methodology

The research involves constructing a refusal classifier based on a dataset compiled from various sources, including 1,706 hand-labeled prompts. A binary classification approach was adopted, where prompts were categorized as either compliant or refused. The initial phase involved analyzing a manually labeled dataset to gain insights into the patterns of refusal.

Figure 1: High-level overview of the process of training the prompt classifier. A large set of prompts are submitted to ChatGPT. Most responses are automatically labeled as refusal or compliance by the refusal classifier; they serve as training data for the prompt classifier. Another, smaller set are manually labeled; these serve as test data.

The study further leverages the Quora Insincere Questions dataset to bootstrap a larger machine-labeled dataset, enabling the training of a more robust prompt classifier. Three model types were explored for both refusal identification and prediction: BERT, logistic regression, and random forest.

Results

The paper reports that the BERT model consistently outperformed the traditional logistic regression and random forest models. The refusal classification achieved an accuracy of 96.5%, while prompt refusal prediction accuracy stood at 75.9%.

Figure 2: Top regression coefficients predictive of ChatGPT's refusal of comply with a prompt (left), and to a given response being a refusal (right) in the combination of all our hand-labeled data.

Predictive features highlighted salient $n$ -grams such as specific demographic names and controversial figures, which are strong predictors of prompt refusal. Words like "cannot," "sorry," and references to ChatGPT's nature as an AI were indicative of refusal in responses.

Figure 3: Top regression coefficients predictive of ChatGPT's refusal of comply with a prompt (left), and to a given response being a refusal (right) in 10,000 machine-labeled samples of the Quora Insincere Questions dataset.

Discussion

The investigation reaffirms the hypothesis that predictive refusal behavior in ChatGPT is influenced heavily by the sentiments embedded in the prompts. The study aptly demonstrates that while refusal behavior can be predicted with a fair degree of accuracy, the variability inherent in prompts relating to disparate topics can introduce classification challenges.

The implications of these findings underscore the necessity for transparent fine-tuning datasets and methodologies, prompting a reconsideration of current ethical frameworks governing AI LLMs. Future research could explore extending the dataset size, employing alternative model architectures, or investigating the temporal stability of prediction models across various versions of ChatGPT.

Conclusion

This study contributes significantly to understanding how generative LLMs can be analyzed and improved through refined refusal prediction systems. By providing insights into the factors governing prompt refusal, this research lays the groundwork for more equitable and effective AI systems that balance expression freedom with societal and ethical considerations.

Markdown