Non-Determinism in GPT Models

Updated 22 November 2025

Non-determinism in GPT models is the phenomenon where identical inputs yield different outputs due to inherent randomness in sampling methods.
Empirical evaluations show that variations in temperature, top-k, and top-p sampling affect consistency in text annotation and code generation tasks.
Techniques such as pooling, best-of-N sampling, and parameter calibration are employed to quantify and mitigate non-determinism for improved model performance.

Non-determinism in GPT models denotes the phenomenon where, for a fixed input prompt and constant model parameters, repeated queries yield variable outputs. This arises from the stochastic nature of the autoregressive generation process, typically introduced through sampling procedures (such as temperature scaling, top-k, or nucleus [top-p] sampling) and is further influenced by decoding parameters, randomness in token selection, and even minute variations in prompt construction. Non-determinism is fundamental in characterizing the variability of GPT model outputs, affects both scientific validity and practical deployment, and sets the context for performance reporting and evaluation in natural language and code generation tasks.

1. Formal Definitions and Mechanistic Sources

In GPT-style models, let $x$ denote the input prompt and $y$ the generated output sequence. The model defines a conditional distribution $P(y|x)$ . Deterministic decoding (greedy or argmax sampling) maps $x$ to the most probable $y$ ; stochastic decoding samples $y \sim P(\cdot | x)$ , yielding a distribution of possible outputs.

Primary algorithmic sources of non-determinism include:

Decoding strategy: Greedy decoding yields deterministic outputs, while stochastic methods (temperature, top- $k$ , top- $p$ ) make each token selection probabilistic.
Temperature: Scaling logits by $1/T$ (temperature $T$ ) with

$p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$

increases entropy for $T > 1$ , and sharpens distributions for $T \to 0$ .

Top-p (nucleus) sampling: Tokens are sampled from a prefix $V_p$ such that $\sum_{i\in V_p} p_i \geq p_{\text{threshold}}$ ; higher thresholds increase non-determinism, lower values make the output more deterministic.
Other parameters: Top- $k$ restriction, repetition penalty, and length penalty modulate distributional support, influencing both variance and diversity of outputs.
Prompt micro-variability: Slight changes in input wording alter the sampled sequence distribution, compounding non-determinism (Reiss, 2023).

In the code generation context, even with deterministic-seeming settings (e.g., $T=0$ ), nonzero non-determinism persists, as evidenced by persistent output variability across multiple runs (Ouyang et al., 2023, Donato et al., 7 Feb 2025).

2. Formal Evaluation and Quantification

To rigorously quantify non-determinism, repeated sampling is applied. For a given $(x, \text{model}, \text{parameters})$ , $N$ independent decoding runs yield $\{y_1, y_2, \ldots, y_N\}$ , and outputs are evaluated using a task-specific metric $\mathrm{score}(y; x)$ . Empirical statistics for non-determinism include:

Expected score:

$\mathbb{E}_{y \sim P(\cdot | x)} [\mathrm{score}(y;x)]$

Variance and performance gap:

$\mathrm{Var}[\mathrm{score}] = \mathbb{E}_y [\mathrm{score}(y;x)^2] - \bigl( \mathbb{E}_y[\mathrm{score}(y;x)] \bigr)^2$

$\Delta = \max_i \mathrm{score}(y_i; x) - \min_i \mathrm{score}(y_i; x)$

For classification, reliability is measured using agreement metrics such as Cohen’s $\kappa$ and Krippendorff’s $\alpha$ :

$\alpha = 1 - \frac{D_o}{D_e}$

where $D_o$ is observed disagreement and $D_e$ is the expected disagreement by chance. In zero-shot text annotation, $\alpha < 0.8$ implies insufficient reliability (Reiss, 2023). For code generation, semantic (test-pass variance, output equivalence rate), syntactic (LCS, edit distance), and structural (AST similarity) measures are reported (Ouyang et al., 2023, Donato et al., 7 Feb 2025).

3. Empirical Findings Across Task Types and Configurations

Non-determinism exhibits significant variation across task domains, model configurations, and evaluation settings.

Text Classification and Annotation

Classification outputs from GPT-3.5-Turbo over identical prompts at high temperature ( $T=1.0$ ) yield Krippendorff’s $\alpha \approx 0.63$ (low), improving to $\alpha \approx 0.94$ only with low temperature ( $T=0.25$ ) and majority voting aggregation (Reiss, 2023).
Minor prompt wording changes drastically affect label reliability: mean $\alpha$ drops to $0.43$ across minimal instruction paraphrases, rarely exceeding scientific thresholds for replicability unless extensive pooling (10-vote aggregation) is used.

Code Generation

In CodeContests with $T=1$ , $72.73\%$ of tasks had zero-overlap in outputs across five completions (OER=0), signaling extreme semantic non-determinism (Ouyang et al., 2023).
Even at $T=0$ , $16.36\%$ of tasks had zero output equivalence, and useful diversity in plausible code coverage was accessible only via multiple completions.
Syntactic non-determinism, measured by mean LCS and Levenshtein edit distances, corroborated these trends, as did AST-level structure comparisons.

Task and Model Scaling Effects

Open-ended and generative benchmarks (e.g., HumanEval, GSM8K) display high output variance ( $\sigma$ up to 1.8, $\Delta$ often above 10 EM points), while constrained-output tasks (MMLU, MixEval) consistently show low variance ( $\sigma < 1.0$ , $\Delta < 3.0$ ) (Song et al., 15 Jul 2024).
Model scaling (from $0.5$B up to $7$B or larger) does not provide a universal reduction in non-determinism; variance is nearly invariant with model size for fixed configurations (Song et al., 15 Jul 2024).

4. Impact of Decoding Parameters

The influence of temperature and top-p on non-determinism is nuanced.

Varying temperature ( $\tau$ ) at fixed top-p ($0.95$) increases the diversity of methods covered by plausible completions, but only moderately affects per-request plausibility ( $\sim4\%$ absolute gain from $\tau = 0$ to $1.2$) (Donato et al., 7 Feb 2025).
Top-p has a greater control on output quality and diversity: reducing top-p from $0.95$ to $0.0$ increases plausible code rate by $31\%$ (from $37.6\%$ to $49.3\%$ per request) and increases the fraction of methods with at least one plausible completion by $7\%$ (Donato et al., 7 Feb 2025).
For code, pass@k curves demonstrate diminishing returns: five repeats recover $80$– $90\%$ of maximal plausible code coverage, highlighting both the necessity and sufficiency of repetition for reliability.

Parameter	Per-request plausibility	Methods covered (≥1 plausible in 5 reps)
$\tau=0.0$ , top-p=0.95	22.9%	70.3%
$\tau=1.2$ , top-p=0.95	26.7%	75.4%
$\tau=1.2$ , top-p=0.0	49.3%	82.3%

5. Variance Reduction Techniques: Pooling, Best-of-N, and Alignment

Pooling and Majority Voting: Aggregating up to 10 independent completions per prompt and using majority voting improves reliability in classification (Krippendorff’s $\alpha$ increases by $0.1$–$0.2$), but rarely eliminates risk from non-determinism unless both temperature is low and pooling is extensive (Reiss, 2023).
Best-of-N Sampling: In language and code generation, best-of-N (oracle selection) and reward-model-based re-ranking significantly boost attainable scores; for example, Llama-3-8B can outperform GPT-4-Turbo when N is large, indicating “latent” capabilities accessible through non-determinism exploitation (Song et al., 15 Jul 2024).
Alignment and Calibration: Fine-tuning with methods such as DPO, KTO, and SimPO reduces sampling variance and increases average quality for select tasks. However, benefits are not universal; certain alignment methods may lower performance on constrained tasks like MMLU (Song et al., 15 Jul 2024).

6. Implications for Evaluation Practices and Reproducibility

Non-determinism, if not accounted for, degrades scientific replicability and robustness of empirical claims. Key recommendations include:

Always reporting sample statistics (mean, variance, $\Delta$ ) alongside greedy and single-run results (Song et al., 15 Jul 2024, Ouyang et al., 2023).
For code or open-ended generation, perform multiple independent samples per prompt; in most cases, at least five repetitions are needed to recover reliable best-case or expected scores (Donato et al., 7 Feb 2025).
Report all relevant sampling parameters (temperature, top-p, top- $k$ ) and number of repetitions; provide raw outputs and prompts for exact replication (Donato et al., 7 Feb 2025, Ouyang et al., 2023).
Use reliability metrics (Cohen’s $\kappa$ , Krippendorff’s $\alpha$ ) to assess annotation outputs, and validate models against human-annotated references (Reiss, 2023).
Avoid using temperature $= 0$ as a shortcut to determinism; non-determinism persists, and coverage may drop compared to moderate diversity settings (Donato et al., 7 Feb 2025).

7. Significance and Future Directions

Non-determinism must not be viewed as a peripheral artifact but rather as a principal axis of evaluation and optimization in GPT models. It is integral to understanding downstream task performance, guiding practical parameter settings, enabling advanced techniques (e.g., best-of-N harnessing “hidden” model competence), and ensuring reliable deployment in both scientific and real-world applications. Future research directions include the development of improved ranking architectures for output selection, exploration of calibration techniques that further align sampling distributions with optimal outputs, and the theoretical characterization of non-determinism’s limits under various decoding regimes (Song et al., 15 Jul 2024, Donato et al., 7 Feb 2025, Ouyang et al., 2023, Reiss, 2023).