Papers
Topics
Authors
Recent
Search
2000 character limit reached

Non-Determinism in GPT Models

Updated 22 November 2025
  • Non-determinism in GPT models is the phenomenon where identical inputs yield different outputs due to inherent randomness in sampling methods.
  • Empirical evaluations show that variations in temperature, top-k, and top-p sampling affect consistency in text annotation and code generation tasks.
  • Techniques such as pooling, best-of-N sampling, and parameter calibration are employed to quantify and mitigate non-determinism for improved model performance.

Non-determinism in GPT models denotes the phenomenon where, for a fixed input prompt and constant model parameters, repeated queries yield variable outputs. This arises from the stochastic nature of the autoregressive generation process, typically introduced through sampling procedures (such as temperature scaling, top-k, or nucleus [top-p] sampling) and is further influenced by decoding parameters, randomness in token selection, and even minute variations in prompt construction. Non-determinism is fundamental in characterizing the variability of GPT model outputs, affects both scientific validity and practical deployment, and sets the context for performance reporting and evaluation in natural language and code generation tasks.

1. Formal Definitions and Mechanistic Sources

In GPT-style models, let xx denote the input prompt and yy the generated output sequence. The model defines a conditional distribution P(yx)P(y|x). Deterministic decoding (greedy or argmax sampling) maps xx to the most probable yy; stochastic decoding samples yP(x)y \sim P(\cdot | x), yielding a distribution of possible outputs.

Primary algorithmic sources of non-determinism include:

  • Decoding strategy: Greedy decoding yields deterministic outputs, while stochastic methods (temperature, top-kk, top-pp) make each token selection probabilistic.
  • Temperature: Scaling logits by $1/T$ (temperature TT) with

yy0

increases entropy for yy1, and sharpens distributions for yy2.

  • Top-p (nucleus) sampling: Tokens are sampled from a prefix yy3 such that yy4; higher thresholds increase non-determinism, lower values make the output more deterministic.
  • Other parameters: Top-yy5 restriction, repetition penalty, and length penalty modulate distributional support, influencing both variance and diversity of outputs.
  • Prompt micro-variability: Slight changes in input wording alter the sampled sequence distribution, compounding non-determinism (Reiss, 2023).

In the code generation context, even with deterministic-seeming settings (e.g., yy6), nonzero non-determinism persists, as evidenced by persistent output variability across multiple runs (Ouyang et al., 2023, Donato et al., 7 Feb 2025).

2. Formal Evaluation and Quantification

To rigorously quantify non-determinism, repeated sampling is applied. For a given yy7, yy8 independent decoding runs yield yy9, and outputs are evaluated using a task-specific metric P(yx)P(y|x)0. Empirical statistics for non-determinism include:

  • Expected score:

P(yx)P(y|x)1

  • Variance and performance gap:

P(yx)P(y|x)2

P(yx)P(y|x)3

For classification, reliability is measured using agreement metrics such as Cohen’s P(yx)P(y|x)4 and Krippendorff’s P(yx)P(y|x)5:

P(yx)P(y|x)6

where P(yx)P(y|x)7 is observed disagreement and P(yx)P(y|x)8 is the expected disagreement by chance. In zero-shot text annotation, P(yx)P(y|x)9 implies insufficient reliability (Reiss, 2023). For code generation, semantic (test-pass variance, output equivalence rate), syntactic (LCS, edit distance), and structural (AST similarity) measures are reported (Ouyang et al., 2023, Donato et al., 7 Feb 2025).

3. Empirical Findings Across Task Types and Configurations

Non-determinism exhibits significant variation across task domains, model configurations, and evaluation settings.

Text Classification and Annotation

  • Classification outputs from GPT-3.5-Turbo over identical prompts at high temperature (xx0) yield Krippendorff’s xx1 (low), improving to xx2 only with low temperature (xx3) and majority voting aggregation (Reiss, 2023).
  • Minor prompt wording changes drastically affect label reliability: mean xx4 drops to xx5 across minimal instruction paraphrases, rarely exceeding scientific thresholds for replicability unless extensive pooling (10-vote aggregation) is used.

Code Generation

  • In CodeContests with xx6, xx7 of tasks had zero-overlap in outputs across five completions (OER=0), signaling extreme semantic non-determinism (Ouyang et al., 2023).
  • Even at xx8, xx9 of tasks had zero output equivalence, and useful diversity in plausible code coverage was accessible only via multiple completions.
  • Syntactic non-determinism, measured by mean LCS and Levenshtein edit distances, corroborated these trends, as did AST-level structure comparisons.

Task and Model Scaling Effects

  • Open-ended and generative benchmarks (e.g., HumanEval, GSM8K) display high output variance (yy0 up to 1.8, yy1 often above 10 EM points), while constrained-output tasks (MMLU, MixEval) consistently show low variance (yy2, yy3) (Song et al., 2024).
  • Model scaling (from yy4B up to yy5B or larger) does not provide a universal reduction in non-determinism; variance is nearly invariant with model size for fixed configurations (Song et al., 2024).

4. Impact of Decoding Parameters

The influence of temperature and top-p on non-determinism is nuanced.

  • Varying temperature (yy6) at fixed top-p (yy7) increases the diversity of methods covered by plausible completions, but only moderately affects per-request plausibility (yy8 absolute gain from yy9 to yP(x)y \sim P(\cdot | x)0) (Donato et al., 7 Feb 2025).
  • Top-p has a greater control on output quality and diversity: reducing top-p from yP(x)y \sim P(\cdot | x)1 to yP(x)y \sim P(\cdot | x)2 increases plausible code rate by yP(x)y \sim P(\cdot | x)3 (from yP(x)y \sim P(\cdot | x)4 to yP(x)y \sim P(\cdot | x)5 per request) and increases the fraction of methods with at least one plausible completion by yP(x)y \sim P(\cdot | x)6 (Donato et al., 7 Feb 2025).
  • For code, pass@k curves demonstrate diminishing returns: five repeats recover yP(x)y \sim P(\cdot | x)7–yP(x)y \sim P(\cdot | x)8 of maximal plausible code coverage, highlighting both the necessity and sufficiency of repetition for reliability.
Parameter Per-request plausibility Methods covered (≥1 plausible in 5 reps)
yP(x)y \sim P(\cdot | x)9, top-p=0.95 22.9% 70.3%
kk0, top-p=0.95 26.7% 75.4%
kk1, top-p=0.0 49.3% 82.3%

5. Variance Reduction Techniques: Pooling, Best-of-N, and Alignment

  • Pooling and Majority Voting: Aggregating up to 10 independent completions per prompt and using majority voting improves reliability in classification (Krippendorff’s kk2 increases by kk3–kk4), but rarely eliminates risk from non-determinism unless both temperature is low and pooling is extensive (Reiss, 2023).
  • Best-of-N Sampling: In language and code generation, best-of-N (oracle selection) and reward-model-based re-ranking significantly boost attainable scores; for example, Llama-3-8B can outperform GPT-4-Turbo when N is large, indicating “latent” capabilities accessible through non-determinism exploitation (Song et al., 2024).
  • Alignment and Calibration: Fine-tuning with methods such as DPO, KTO, and SimPO reduces sampling variance and increases average quality for select tasks. However, benefits are not universal; certain alignment methods may lower performance on constrained tasks like MMLU (Song et al., 2024).

6. Implications for Evaluation Practices and Reproducibility

Non-determinism, if not accounted for, degrades scientific replicability and robustness of empirical claims. Key recommendations include:

  • Always reporting sample statistics (mean, variance, kk5) alongside greedy and single-run results (Song et al., 2024, Ouyang et al., 2023).
  • For code or open-ended generation, perform multiple independent samples per prompt; in most cases, at least five repetitions are needed to recover reliable best-case or expected scores (Donato et al., 7 Feb 2025).
  • Report all relevant sampling parameters (temperature, top-p, top-kk6) and number of repetitions; provide raw outputs and prompts for exact replication (Donato et al., 7 Feb 2025, Ouyang et al., 2023).
  • Use reliability metrics (Cohen’s kk7, Krippendorff’s kk8) to assess annotation outputs, and validate models against human-annotated references (Reiss, 2023).
  • Avoid using temperature kk9 as a shortcut to determinism; non-determinism persists, and coverage may drop compared to moderate diversity settings (Donato et al., 7 Feb 2025).

7. Significance and Future Directions

Non-determinism must not be viewed as a peripheral artifact but rather as a principal axis of evaluation and optimization in GPT models. It is integral to understanding downstream task performance, guiding practical parameter settings, enabling advanced techniques (e.g., best-of-N harnessing “hidden” model competence), and ensuring reliable deployment in both scientific and real-world applications. Future research directions include the development of improved ranking architectures for output selection, exploration of calibration techniques that further align sampling distributions with optimal outputs, and the theoretical characterization of non-determinism’s limits under various decoding regimes (Song et al., 2024, Donato et al., 7 Feb 2025, Ouyang et al., 2023, Reiss, 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Non-Determinism of GPT Models.