What do Language Model Probabilities Represent? From Distribution Estimation to Response Prediction (2505.02072v1)

Published 4 May 2025 in cs.CL and cs.AI

Abstract: The notion of language modeling has gradually shifted in recent years from a distribution over finite-length strings to general-purpose prediction models for textual inputs and outputs, following appropriate alignment phases. This paper analyzes the distinction between distribution estimation and response prediction in the context of LLMs, and their often conflicting goals. We examine the training phases of LLMs, which include pretraining, in-context learning, and preference tuning, and also the common use cases for their output probabilities, which include completion probabilities and explicit probabilities as output. We argue that the different settings lead to three distinct intended output distributions. We demonstrate that NLP works often assume that these distributions should be similar, which leads to misinterpretations of their experimental findings. Our work sets firmer formal foundations for the interpretation of LLMs, which will inform ongoing work on the interpretation and use of LLMs' induced distributions.

Summary

The paper formalizes the shift from distribution estimation to response prediction, clarifying how LLM output probabilities represent different underlying distributions.
It details how varied training stages—pretraining, supervised fine-tuning, and RLHF—influence the model’s focus on either distribution estimation or response prediction.
It advises practitioners to carefully align inference methods with their goals to avoid misinterpreting mode-seeking behavior as reflecting true event probabilities.

This paper, "What do LLM Probabilities Represent? From Distribution Estimation to Response Prediction" (2505.02072), analyzes the fundamental shift in how LLMs are used and interpreted, moving from traditional language modeling (distribution estimation) towards general-purpose prediction and response generation. The authors argue that this shift, combined with common training and inference strategies, leads to differing interpretations of LLM output probabilities, which are often conflated in existing research, leading to misinterpretations.

The paper formalizes the distinction between three core concepts:

Distribution Estimation: The goal is to learn a model that approximates the data-generating (source) distribution $p(y|x)$ . This is the traditional goal of language modeling, aiming to capture the statistical patterns in language usage. A special case is Target Distribution Estimation, where the goal is to estimate a distribution $p^*(y|x)$ different from the source distribution, often due to biases or corruption in the source data.
Response Prediction: The goal is to return a single optimal output $y$ for a given input $x$ . Predictors are typically optimized to minimize a loss function, such as misclassification error, often achieved by selecting the mode (argmax) of some distribution.

The paper examines how modern LLMs are trained and used, linking these processes to the different interpretations of probability. Common training stages include:

Pretraining: Typically involves large-scale next-token prediction, minimizing cross-entropy on a vast text corpus. This primarily trains the model for Source Distribution Estimation.
Adaptation/Fine-tuning:
- Supervised Fine-Tuning (SFT): Training on datasets of instruction-response pairs, minimizing cross-entropy on "helpful" responses. This biases the model towards generating specific, desired outputs for given instructions, aligning it more with Response Prediction.
- Human Preferences (e.g., RLHF): Training the model to maximize a reward function that reflects human preferences. This explicitly optimizes the model's output behavior for a given prompt, further emphasizing Response Prediction.

During inference, the model outputs probabilities over the next token. However, users are often interested in the probability of a referent event $y$ , which might correspond to multiple possible output strings $w$ . A mapping function $\phi: V^* \to \mathcal{Y}$ translates strings to events. The paper discusses two main inference approaches to obtain probabilities over events:

Logit Probabilities: Summing the generation probabilities $p(w|context)$ $p (w ∣ co n t e x t)$ for all strings $w$ $w$ that map to a specific event $y$ $y$ via $\phi(w)=y$ $ϕ (w) = y$ . This can be done with:
- Naive completion: Using the raw generation probability $p(w|x)$ for context $x$ .
- Zero-shot/Few-shot Instruction: Using the generation probability $p(w|I(x))$ or $p(w|FS(x))$ where $I(x)$ or $FS(x)$ formats $x$ as an instruction, potentially with examples.
Explicit Probability Report: Instructing the model to verbally state probabilities (e.g., "The probability is 80%").

The paper argues that the appropriate interpretation of probabilities depends heavily on the user's goal and the inference method used.

Text completion: The goal is to generate text similar to human-written text, which aligns with Source Distribution Estimation. Naive completion (I1a) from a pre-trained model (T1) is appropriate, ideally reflecting $p_{LM}(w|x)$ .
Response generation: The goal is to produce a "correct" or preferred response to a query, aligning with Response Prediction. Instruction-based inference (I1b/I1c) with SFT (T2a) and/or RLHF (T2b) is used. The output probability distribution is shaped by the training objectives to prioritize high-quality responses.
Event modeling: The goal is to estimate the true probability of a world event $y$ , aligning with Target Distribution Estimation. This is the most challenging use case. Logit probabilities via instruction-based inference will likely reflect the response distribution (predicting the mode), not the true event distribution. Naive completion might reflect the event distribution only if the pretraining data is an unbiased sample of events, which is often not the case (e.g., reporting bias [paik-etal-2021-world]). Explicit probability reporting (I2) is presented as potentially the most aligned method but is heavily dependent on the model's ability to infer and express true event probabilities accurately, requiring extensive adaptation training.

The authors use a simplified agent model based on Belief-Desire-Intention (BDI) to analyze ideal output distributions for different prompts and scenarios:

Multiple Descriptions of an Outcome: If an input does not impose a specific desire, the LM should reflect the frequency of different ways to express the same event, aligning with traditional language modeling and Source Distribution Estimation.
Observed Outcome: If the agent observes the outcome and its desire is to report it faithfully, the output probability should ideally reflect the true event probability. However, real-world training data and agent biases mean the reported distribution might not be unbiased, distinguishing this from ideal Target Distribution Estimation.
Unobserved Outcome: If the agent does not know the outcome and is instructed to predict, the rational behavior for maximal accuracy is to predict the mode (most likely outcome). The output distribution in this case should ideally place all mass on the mode, aligning with Response Prediction. The paper discusses how training (T2a, T2b) and in-context examples (I1c) can influence the agent's choice of prediction function.

The paper highlights several common misconceptions arising from implicitly assuming equality between these different distributions:

Completion distribution $\neq$ Response distribution: Comparing raw completion probabilities (distribution estimation) with probabilities elicited via metalinguistic prompts ("What word is most likely?") (response prediction) [hu-levy-2023-prompting] is unwarranted. Similarly, evaluating text generation by metrics suitable for prediction (like highest likelihood) overlooks that text generation is a distribution estimation task.
Response distribution $\neq$ Event distribution: Interpreting the generation probability of a response as a direct measure of the model's confidence or the correctness probability of the underlying assertion [yona-etal-2024-large, liu-etal-2023-cognitive] is incorrect. Calibration studies that treat generation probabilities as confidence scores [guo2017calibration, 10.1145/3618260.3649777, zhang-etal-2024-calibrating] often find poor calibration because the model's optimal strategy for response prediction is not necessarily to generate with a frequency matching the event probability.
Completion distribution $\neq$ Event distribution: Assuming that the frequency of certain statements in pretraining data (completion distribution) should reflect the truthfulness or correctness probability of the underlying event (event distribution) [sorensen2024position] is problematic. Models trained on text reflecting population beliefs might generate incorrect information if the belief distribution doesn't match the truth (the "is-ought problem"). Equating annotator agreement distribution with correctness probability [baan-etal-2022-stop] also falls into this category.

In conclusion, the paper emphasizes that LLMs, especially after adaptation phases like SFT and RLHF, are often optimized for response prediction, where the output probability distribution aims to maximize accuracy or preference, frequently leading to mode-seeking behavior. This is distinct from traditional language modeling (source distribution estimation) and the estimation of true world-event probabilities (target distribution estimation). Practitioners should be cautious when interpreting LLM probabilities, understanding which underlying distribution the model is likely representing based on its training and the inference strategy used. Explicit probability reporting, while challenging to implement robustly, appears most aligned with the goal of event modeling. The paper calls for clearer formal foundations for interpreting LLM probabilities to avoid misinterpretations in research and application.