First-Token Log-Probabilities in Language Models

Updated 28 October 2025

First-token log-probabilities are the logarithm of the probability assigned to the first generated token, capturing model uncertainty and the influence of prior knowledge.
They play a critical role in model initialization, evaluation robustness, and semantic plausibility, linking bias initialization to training corpus statistics.
Practical applications include forecasting events, retrieval augmentation, hallucination detection, and efficient chain-of-thought reasoning compression.

First-token log-probabilities are a foundational concept in probabilistic language modeling, representing the logarithm of the probability assigned to the first token generated by a LLM after a prompt or conditioning context. The quantity encapsulates the model’s output uncertainty and prior knowledge before any further autoregressive generation occurs. This notion is central to a wide range of research on model initialization, evaluation robustness, semantic plausibility, hallucination detection, context sensitivity, domain adaptation, and model compression.

1. Mathematical Foundations and Model Architectures

In standard autoregressive LLMs, the conditional output probability for the next token $y$ given context $y_{<t}$ is computed via a softmax over a linear transformation: $P(y \mid y_{<t}) = \mathrm{softmax}(W h + b)_y$ where $W$ is the output weight matrix, $h$ is the context-dependent representation, and $b$ is a bias vector. The probability can be factorized as: $P(y \mid y_{<t}) \propto e^{W h} \cdot e^{b}$ At the first generation step (i.e., when $h$ encodes either a limited or no context), the model’s prediction can be dominated by the bias term, and the log-probabilities assigned to the first tokens critically reflect any prior encoded in $b$ (Meister et al., 2022).

In some models, explicit log-probability vectors over the vocabulary are available: $r = [\log P(t_1 \mid x), \log P(t_2 \mid x), \ldots, \log P(t_{|V|} \mid x)]$ where $x$ denotes the full prompt/context and $t_j$ are the candidate tokens (Kainan et al., 26 Oct 2025).

2. Statistical Priors, Bias Initialization, and Token Frequency

Early in training, models tend to mimic the unigram distribution of the training corpus. Initializing the bias vector as

$b_y = \log \pi_y$

where $\pi_y$ is the empirical unigram probability of token $y$ , directly encodes this distribution into the first-token probabilities. This initialization ensures that, in the absence of context, first-token probabilities reproduce the corpus statistics, yielding improved sample efficiency and enabling the model to specialize subsequent contextual layers ( $W h$ ) on non-frequency effects (Meister et al., 2022). During training, the bias term remains remarkably stable, capturing frequency statistics while the contextual parameters are free to model more complex dependencies.

In output embedding spaces, there exists a log-linear relationship between average token probabilities and token embeddings: $-\log \alpha_{w,\mathcal{D},\theta} \approx A_\mathcal{D} \cdot E^{(o)}_w + B_\mathcal{D}$ where $E^{(o)}_w$ is the output embedding for token $w$ , and $A_\mathcal{D}$ , $B_\mathcal{D}$ are dataset-dependent constants (Cho et al., 3 Jun 2024). This linear encoding is sparse, and early in training, output embeddings capture token frequency structure well before convergence of other model parameters.

3. Evaluation, Robustness, and Limitations

First-token log-probabilities have traditionally been used for evaluation in multiple-choice question answering (MCQA) by ranking candidate answers according to their first-token score. However, in instruction-tuned models, this metric often severely misaligns with the actual text answer, yielding mismatch rates exceeding 60%, particularly when the model’s response style includes preambles or refusals (Wang et al., 22 Feb 2024). Mismatch between first-token predictions and completed text responses increases as the model is further fine-tuned for conversational intent or refusal handling.

Robustness analysis reveals that text-based evaluations (classifying the full model output) are consistently less sensitive to perturbations in option order, typos, and swaps than first-token evaluations. This robustness gap exacerbates with increasing mismatch rates; even state-of-the-art debiasing techniques such as PriDe may fail to match the text-based accuracy when the model answer does not directly start with an option token (Wang et al., 12 Apr 2024).

Marginalization over tokenizations is another nuance: computing log-probabilities solely for the default tokenization can lead to slight underestimation, especially in strings or languages with complex or ambiguous token boundaries. Importance sampling can yield more precise marginal estimates, but its computational cost is typically justified only in niche analytical scenarios, as the operational gap is less than 0.5% for most practical cases (Chirkova et al., 2023).

4. Semantic Plausibility and Context Sensitivity

First-token log-probabilities serve as a reliable proxy for model semantic world knowledge. By comparing log-probabilities assigned to plausible versus implausible continuations, rigorous metrics of semantic plausibility can be computed: $P(\text{expected target}) > P(\text{anomalous target})$ and general/context-dependent extensions thereof. These metrics consistently track human judgments more closely than responses retrieved by explicit prompting, particularly when measured at the critical word/token rather than at the sentence level (Kauf et al., 21 Mar 2024). Contextual manipulations modulate first-token probabilities in expected ways; supportive context rescues the probability of otherwise anomalous tokens.

Instruction-tuning may dilute the sensitivity of first-token probabilities as plausibility predictors, "washing out" low-level cues. Nonetheless, first-token scores remain preferred over prompt-based approaches when raw model knowledge (as opposed to conversational alignment) is the evaluation target.

5. Psycholinguistic Predictors and Entropy Estimation

Psycholinguistic research often uses the entropy of a model’s token distribution as a predictor for human reading times and processing difficulty. First-token entropy approximations compute

$H(W_i \mid w_{1..i-1}) \approx -\sum_{t \in T} P(t \mid \text{context}) \log_2 P(t \mid \text{context})$

using only the probability distribution over the first subword token. For words spanning multiple subword tokens, this approach systematically underestimates true entropy. Monte Carlo estimates, which sample full tokenizations and accumulate surprisal across all subword tokens, yield unbiased quantitative measures more predictive of behavioral data (Clark et al., 29 Jul 2025). The discrepancy is especially pronounced for open-class words and domains with high tokenization variance.

6. Practical Applications: Forecasting, Retrieval Augmentation, Hallucination Detection, and Efficient Reasoning

First-token log-probabilities underpin several recent applied frameworks:

Forecasting Future Events: Probabilities for each candidate token representing a probabilistic outcome are converted via $e^{w_i}$ to linear weights, and combined in a weighted average to calibrate event predictions. The Brier score quantifies forecast accuracy, with improvements observed over baseline and vanilla LLM systems (Soru et al., 8 Jan 2025).
Domain-Adaptive RAG: In Retrieval-Augmented Generation for domain-specific MCQA (e.g., telecommunications), first-token probability guides context chunk selection and retrieval hyperparameter adaptation. The model’s confidence is estimated using softmax-normalized first-token scores, driving iterative optimization of chunk number and window size for enhanced retrieval quality (Chen et al., 11 Jan 2025).
Token-Level Hallucination Detection: Variance in first-token log-probabilities across stochastic generations is used to identify tokens or spans exhibiting instability. If the variance $\text{Var}_t$ for token position $t$ exceeds a threshold, the token is flagged as hallucinated. Larger, more instruction-tuned models present lower hallucination rates and more stable first-token distributions (Kumar, 5 Jul 2025).
Chain-of-Thought Reasoning Compression: In code reasoning tasks, step-level first-token surprisal metrics $S(x_t | x_{<t}) = -\log p(x_t | x_{<t})$ are used to rank and retain logically critical reasoning steps, reducing computational cost while maintaining high accuracy (Zeng et al., 8 Aug 2025).
Boilerplate Detection: First-token log-probability vectors, when clustered via t-SNE and analyzed with k-NN classifiers, separate substantive answers from boilerplate (refusals, greetings). This enables early inference termination or dynamic routing to smaller models for cost savings (Kainan et al., 26 Oct 2025).

7. Synthesis and Emerging Directions

First-token log-probabilities offer a rich signal for model prior knowledge, context sensitivity, uncertainty quantification, and computational optimization. While effective for model introspection, semantic evaluation, and diagnostic applications, their limitations in capturing full answer fidelity—especially in instruction-tuned, free-form generation domains—necessitate caution and alternative downstream evaluation protocols. Marginalization, context alignment, and multi-token aggregation methods represent advancing directions to address these challenges.

Consistent themes include the early emergence of frequency/statistical priors, the utility of first-token metrics in model and application diagnostics, and the importance of leveraging full output context for robust evaluation. These insights have broad implications for LLM research, psycholinguistics, information retrieval, and the design of efficient and reliable generative systems.