InstructGPT (text-davinci-002) Overview

Updated 25 June 2026

The InstructGPT model (text-davinci-002) is a large-scale language model fine-tuned via supervised learning on high-quality instruction pairs.
It utilizes a 175B-parameter code-davinci-002 backbone and the FeedME procedure to optimize adherence to user instructions.
The model excels in NLU benchmarks and analogy generation with robust few-shot learning, though explanation accuracy can be limited.

InstructGPT Model (text-davinci-002) is a large-scale pre-trained LLM in the GPT-3.5 family, developed via supervised fine-tuning on code-davinci-002 (Codex) and designed to more closely follow user instructions. Unlike its successor models, text-davinci-002 does not employ reinforcement learning from human feedback (RLHF) but relies on extensive supervised learning using high-quality human-written input–output pairs. It demonstrates strong performance across natural language understanding (NLU) tasks, excels in analogy generation under strict prompt engineering, and uniquely benefits from explanations in few-shot prompting, although explanation faithfulness remains a known limitation (Bhavya et al., 2022, Ye et al., 2022, Ye et al., 2023).

1. Architecture and Training Paradigm

text-davinci-002 uses the code-davinci-002 (Codex) backbone—a 175B parameter transformer model trained on diverse natural language and public code corpora. Its fine-tuning protocol consists of the “FeedME” procedure, where human annotators author high-quality instruction–response pairs. These samples are rated by humans, and model parameters are updated to minimize a standard cross-entropy objective: $\mathcal{L}_{\mathrm{SFT}(\theta)} = -\mathbb{E}_{(x,y)\sim \mathcal{D}}\bigl[\log \pi_\theta(y\mid x)\bigr]$ No RLHF stage is used in text-davinci-002, distinguishing it from text-davinci-003 and later models, where human preference modeling refines outputs via reinforcement learning (Ye et al., 2023). The absence of RLHF correlates with distinctive behavioral and generalization properties in downstream evaluation.

2. Prompt Engineering and Analogical Reasoning

Prompt sensitivity is a critical factor in text-davinci-002's capacity for analogy generation. Two formal tasks are defined (Bhavya et al., 2022):

Analogous Concept Generation (ACG, or no_src): Given a target concept $T$ , generate a source concept $S$ analogous to $T$ , possibly accompanied by an explanation $E$ , with the requirement that $AnalogicalSimilarity(T,S)\geq \tau_1$ .
Analogous Explanation Generation (AEG, or wsrc): Given $T$ and a candidate $S$ , produce an explanation $E$ so $ExplanationQuality(E|T,S)\geq \tau_2$ .

Systematic studies contrasted imperative prompts (“Using an analogy, explain <target>.”) with interrogative forms (“What analogy is used to explain <target>?”), controlled synonyms (“analogous to”/“like”/“similar to”), and measured the effect of word order and prompt specificity. Optimal prompts were imperative, precisely containing “analogy” or “analogous to.” Question-form prompts underperformed by 5–10 BLEURT points relative to the best-performing imperatives (e.g., P3ₜₗ BLEURT = 0.462 vs. P4ₜₗ, BLEURT = 0.427, $T$ 0). Low-temperature decoding ( $T$ 1) yielded higher fidelity and consistency, though high temperatures could rescue factual or creative failures for certain complex analogies.

3. Performance Across NLU Benchmarks

Comprehensive evaluations (Ye et al., 2023) spanned nine NLU task families—ABSA, MRC, NER, NLI, POS, RE, SC, SM, and WSC—across 21 datasets, using both zero-shot and few-shot prompt templates. Representative outcomes for text-davinci-002 included:

Aspect-Based Sentiment Analysis (SemEval-2014): 0-shot accuracy: $T$ 2 (Laptop), $T$ 3 (Restaurant).
Machine Reading Comprehension (SQuAD1.1): 0-shot F1: $T$ 4, EM: $T$ 5; 3-shot F1: $T$ 6, EM: $T$ 7.
NLI (MNLI-m): 0-shot: $T$ 8, 3-shot: $T$ 9.
NER (CoNLL2003): 0-shot micro-F1: $S$ 0, 3-shot: $S$ 1.
SC (IMDB): 0-shot: $S$ 2, 3-shot: $S$ 3.
WSC: 0-shot: $S$ 4, 3-shot: $S$ 5.

Performance deltas over predecessors (davinci, text-davinci-001) were consistently positive, especially in zero-shot and few-shot NLU, affirming the effectiveness of the Supervised FeedME recipe (Ye et al., 2023). Compared with code-davinci-002, text-davinci-002 maintained or improved on most tasks except for select MRC and ABSA datasets, where code-davinci-002 retained a slight edge.

4. Explanations in Few-Shot In-Context Learning

text-davinci-002 demonstrated unique sensitivity to explanation-based prompting in few-shot in-context learning (Ye et al., 2022). Three prompt frameworks were evaluated in question answering (QA) and natural language inference (NLI):

Few-Shot (no explanation): $S$ 6.
Explain-then-Predict (E-P): $S$ 7.
Predict-then-Explain (P-E): $S$ 8.

Quantitative gains on text-davinci-002 were largest for E-P (+14.9% on Synth, +4.7% on AdvHotpot, +6.5% on E-SNLI accuracy vs. standard few-shot). Explanations yielded only marginal improvements for other models. P-E benefited less, reflecting order sensitivity for explanatory prompts.

Correlation analysis between explanation reliability (groundedness, logical consistency) and predictive accuracy on Synth and AdvHotpot produced Pearson $S$ 9 and $T$ 0, respectively, indicating that high-quality explanations aligned with greater likelihood of correct answers. However, generated explanations frequently failed factuality and consistency checks, motivating the use of post-hoc calibrators to exploit the unreliability signal as a proxy for answer confidence. This approach improved post-hoc performance by up to +9.1% on E-SNLI and +4.4 AUC on AdvHotpot.

5. Robustness and Sensitivity Analyses

Robustness studies (Ye et al., 2023, Bhavya et al., 2022) revealed that text-davinci-002's outputs degrade under adversarial and synthetic perturbations, mirroring broader trends in GPT-3.5 models. On ABSA, AddDiff perturbations induced a 4-point accuracy drop; in MRC, AddSentDiverse subtracted $T$ 123 points in F1. Named entity, relation, and sentiment tasks showed 10–25 point average drops upon task-specific text or label perturbations.

Spelling error sensitivity further exposed the model’s brittle in-context concept recognition: BLEURT decreased 3–7% for Replace, 4–6% for Delete/Permute, and 1–3% for Insert. Prompt style (imperative vs. interrogative) was the largest architectural lever, with a $T$ 2BLEURT $T$ 3– $T$ 4 and a consistent shortening of average output length when using questions.

6. Human Evaluation, Error Taxonomy, and Model Scaling

Human evaluations of analogy outputs (Bhavya et al., 2022) (1,407 generated analogies, 61 human references) employed majority “meaningful” judgments with three-way agreement: moderate for no_src (Fleiss’ $T$ 5), fair for wsrc ( $T$ 6). On no_src, the 175B Davinci variant achieved a 70.05% majority-yes rate (vs. 66.67% human), but lagged significantly in wsrc explanations (53.79% vs. 71.88%). Small models ( $T$ 71.3B) performed poorly across both tasks.

Error analysis exposed frequent failures:

“No analogy” responses replaced analogizing with definitions.
Irrelevant or mismatched source–target mappings, including polysemy confusions.
Factual inaccuracies and trivial or misleading structural parallels.

7. Limitations, Open Problems, and Prospects

Key limitations remain:

Alignment tax: RLHF in successor models (text-davinci-003) improved conversational alignment but not core NLU.
Robustness gaps and structural prediction: Persistent weaknesses in NER, RE, numeric reasoning, and stability under adversarial or paraphrastic shifts.
Explanation reliability: Generated explanations do not consistently entail the predicted outputs or align with context, even in extractive QA.
Prompt sensitivity: Crafting optimal imperative prompts is crucial; performance deteriorates with suboptimal prompt styles or minor grammatical/shallow errors.

Recommended future directions include incorporating task-specific reward signals into fine-tuning, adversarial data augmentation, hybrid objectives balancing human alignment with raw task accuracy, broadened domain coverage for analogy evaluation, exploration of few-shot and prompt-tuning methods, and investigation of memorization effects versus genuine analogical reasoning. These findings characterize text-davinci-002 as a strong instruction-following NLU model, with unique advantages and empirical constraints in explanation-rich and reasoning tasks (Bhavya et al., 2022, Ye et al., 2022, Ye et al., 2023).

Markdown Report Issue Upgrade to Chat

References (3)

Analogy Generation by Prompting Large Language Models: A Case Study of InstructGPT (2022)

The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning (2022)

A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InstructGPT Model (text-davinci-002).

InstructGPT (text-davinci-002) Overview

1. Architecture and Training Paradigm

2. Prompt Engineering and Analogical Reasoning

3. Performance Across NLU Benchmarks

4. Explanations in Few-Shot In-Context Learning

5. Robustness and Sensitivity Analyses

6. Human Evaluation, Error Taxonomy, and Model Scaling

7. Limitations, Open Problems, and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

InstructGPT (text-davinci-002) Overview

1. Architecture and Training Paradigm

2. Prompt Engineering and Analogical Reasoning

3. Performance Across NLU Benchmarks

4. Explanations in Few-Shot In-Context Learning

5. Robustness and Sensitivity Analyses

6. Human Evaluation, Error Taxonomy, and Model Scaling

7. Limitations, Open Problems, and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research