Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prompting GPT-3 To Be Reliable (2210.09150v2)

Published 17 Oct 2022 in cs.CL

Abstract: LLMs show impressive abilities via few-shot prompting. Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world language applications. However, the crucial problem of how to improve the reliability of GPT-3 is still under-explored. While reliability is a broad and vaguely defined term, we decompose reliability into four main facets that correspond to the existing framework of ML safety and are well-recognized to be important: generalizability, social biases, calibration, and factuality. Our core contribution is to establish simple and effective prompts that improve GPT-3's reliability as it: 1) generalizes out-of-distribution, 2) balances demographic distribution and uses natural language instructions to reduce social biases, 3) calibrates output probabilities, and 4) updates the LLM's factual knowledge and reasoning chains. With appropriate prompts, GPT-3 is more reliable than smaller-scale supervised models on all these facets. We release all processed datasets, evaluation scripts, and model predictions. Our systematic empirical study not only sheds new insights on the reliability of prompting LLMs, but more importantly, our prompting strategies can help practitioners more reliably use LLMs like GPT-3.

An Analysis of Enhancing GPT-3 Reliability Through Prompting Techniques

The paper "Prompting GPT-3 To Be Reliable" addresses an underexplored avenue concerning the reliability aspect of LLMs, particularly exemplified by OpenAI's GPT-3. Despite the model's widespread application through OpenAI's API, concerns about reliability in real-world scenarios prompt an investigation into enhancing GPT-3's dependability. This paper systematically examines facets of reliability, specifically generalizability, social bias, calibration, and factuality, employing distinct prompting strategies as tools for augmentation.

Generalizability

The first dimension of reliability scrutinized is generalizability, defined as the ability to perform across varying distributions. Empirical evaluations utilize datasets such as MRQA for domain shifts, AdvGLUE, and Contrast Sets for adversarial examinations. By integrating few-shot prompting, GPT-3 demonstrates superior robustness compared to smaller, finetuned models like BERT and RoBERTa. Notably, GPT-3 maintains a minimal performance gap across original and out-of-distribution (OOD) scenarios by utilizing randomly sampled examples from source datasets as prompts.

Social Bias Mitigation

Addressing social biases constitutes the second focal area. Utilizing the WinoBias and BBQ datasets, the paper explores gender and broader societal biases in model predictions. Prompts balanced in terms of pro-bias and anti-bias examples yield reduced bias levels. Additionally, natural language interventions further mitigate biases, guiding GPT-3 towards more fair and neutral output. This highlights the model's sensitivity to input structure and its relevance in steering model behavior ethically.

Calibration and Uncertainty

Calibration, or the alignment of model confidence with prediction accuracy, serves as another critical reliability factor. Implementing LLM probabilities and self-consistency sampling as confidence metrics, GPT-3 demonstrates enhanced calibration over supervised models like DPR-BERT, particularly in OOD settings. The paper emphasizes the efficacy of GPT-3's confidence scores for selective predictions, thereby allowing for effective discrimination between accurate and less reliable outputs.

Factuality via Knowledge Updating

Factuality, the fourth facet, pertains to the model's ability to update or replace existing memorized knowledge with new information. Through experiments on datasets like NQ and SQuAD, the research illustrates GPT-3's capability to override memorized answers with updated counterparts when provided with counterfactual passages. When combined with retrieval-augmented prompting, factual QA performance is bolstered, particularly for multi-hop reasoning tasks like those in HotpotQA.

Implications and Future Directions

The implications of the research are manifold, stressing the importance of appropriate prompting strategies to enhance reliability across various dimensions. The findings advocate for more nuanced approaches to leveraging LLMs in practical applications, particularly highlighting the balance between memorization and adaptability. In anticipation of future research, the outlined methodologies provide a foundation to explore additional reliability facets, such as adverse effect mitigation and further bias reduction in high-stakes domains.

Conclusively, the paper provides a compelling argument and evidence for the strategic use of prompting in enhancing the reliability of LLMs such as GPT-3, without recourse to extensive post-hoc modification or retraining schemes. It prompts (pun intended) a reevaluation of few-shot learning techniques as not only tools for task specification but also as integral components in the reliability enhancement toolkit for LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Chenglei Si (26 papers)
  2. Zhe Gan (135 papers)
  3. Zhengyuan Yang (86 papers)
  4. Shuohang Wang (69 papers)
  5. Jianfeng Wang (149 papers)
  6. Jordan Boyd-Graber (68 papers)
  7. Lijuan Wang (133 papers)
Citations (244)
Youtube Logo Streamline Icon: https://streamlinehq.com