Token-Level Log-Probability Checking
- Token-level log-probability checking is the systematic extraction and analysis of per-token probabilities to quantify prediction uncertainty and detect generation anomalies.
- It employs mathematical foundations such as softmax, log-softmax, and entropy calculations to enable localized uncertainty quantification and model calibration.
- Methodologies include divergence measures, variance-based hallucination detection, and code provenance mapping to validate and interpret model outputs.
Token-level log-probability checking refers to the systematic extraction, calibration, and analysis of LLM token–wise log-probabilities—typically derived from the model’s softmax output layer—to measure prediction uncertainty, detect generation anomalies, and compare model predictions with theoretical or empirical expectations. This approach is foundational to tasks including model calibration, uncertainty quantification (UQ), hallucination detection, code provenance analysis, and the paper of probability encodings within neural architectures.
1. Mathematical Foundations and Log-Probability Extraction
At each autoregressive step, a LLM produces logits for each token in its vocabulary. The transformation to probabilities and log-probabilities is canonical:
- Softmax yields probabilities:
- Log-softmax provides log-probabilities directly:
For most practical systems (e.g., via OpenAI and DeepSeek APIs), only the top- log-probabilities are returned:
Entropy at the token level is computed as:
For APIs returning truncated (top-) distributions, the remaining mass is assigned to an “other” pseudo-token for entropy computation. These outputs underpin all subsequent methodological advances in token-level probability checking.
2. Uncertainty Quantification and Calibration Metrics
Token-level log-probabilities permit localized UQ by quantifying the sharpness, spread, or variance of the predicted distribution:
- Token entropy quantifies uncertainty at a position.
- Calibration error such as Expected Calibration Error (ECE) benchmarks how well the predicted probabilities align with empirical frequencies:
- Divergence from theoretical distributions (relevant in probabilistic scenarios) are measured by KL and JS divergences:
Discrepancies may be reported as the difference or as percent error in entropy .
3. Methodologies for Token-Level Analysis
Multiple approaches leverage extracted log-probabilities for higher-level analyses:
A. Probabilistic Task Alignment
Assessment of LLMs in tasks requiring output-distribution alignment to a known (often uniform) reference distribution reveals substantial divergence, even when response validity is perfect. Empirical examples (GPT-4.1, DeepSeek-Chat): | Scenario | Theory | GPT-4.1 | | | | % err | |---------------|--------------|---------------|--------------|---------------------|------------------|----------| | Coin flip | 0.50 | 1.00 | +0.50 | 1.00 bit | 0.0002 bit | ≈ 100% | | Die roll | 0.167 | 0.924 | +0.757 | 2.585 bit | 0.447 bit | 83% | | 52 cards | 0.019 | 0.13 | +0.111 | 5.7 bit | 3.49 bit | 39% |
B. Hallucination and Fact Verification
Several uncertainty-based methods identify hallucinations or unsupported claims:
- Variance-based detection: Variance in token log-probabilities across multiple stochastic generations () strongly correlates with hallucinated content (Kumar, 5 Jul 2025).
- Claim-Conditioned Probability (CCP): Introduces an NLI-based filter to isolate uncertainty about factual claim content versus linguistic or context uncertainty (Fadeeva et al., 7 Mar 2024). CCP is given by:
Aggregated claim-level CCP has demonstrated ROC-AUC 0.78 (Vicuna 13B, human evaluation) versus 0.72 for external-knowledge fact checking.
C. Code Provenance by Log-Probability Maps
CodeVision (Xu et al., 6 Jan 2025) constructs 2D matrices of per-position log-probabilities for code, preserving spatial structure. Feeding these as “grayscale images” to ViT/ResNet models achieves code-generation detection AUC up to 0.99 with ≤20 million parameters, far outperforming single-dimensional heuristics.
4. Algorithmic Implementations and Recipes
A minimal Python workflow for log-probability extraction and distributional testing, adapted for models accessible via vendor APIs, proceeds as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import openai, math def get_token_stats(prompt, model="gpt-4.1"): resp = openai.ChatCompletion.create( model=model, messages=[{"role":"user","content":prompt}], temperature=0.0, max_tokens=1, logprobs=20 ) lp = resp.choices[0].logprobs.top_logprobs[0] probs = {tok: math.exp(logp) for tok,logp in lp.items()} other_mass = 1.0 - sum(probs.values()) if other_mass > 0: probs["<OTHER>"] = other_mass H = -sum(p * math.log2(p) for p in probs.values() if p > 0) return probs, H def kl_divergence(P, Q): return sum(P[i] * math.log(P[i]/Q[i]) for i in P if P[i] > 0) |
- Increase
logprobsto capture tail behaviors. - Compute any divergence metric.
- Aggregate over samples for empirical calibration analysis.
Variance-based hallucination flags are produced by aggregating across independent samples and flagging tokens above a settable threshold (e.g., ). For code provenance, the log-probability matrices are formatted as images and classified via small vision models.
5. Probability Encoding Within LM Architectures
Cho et al. (Cho et al., 3 Jun 2024) demonstrate that output embeddings of LMs encode log-probability structure in a log-linear form: Here, is the output embedding, is a learned direction (estimated via regression on average log-probs over a corpus), and is a normalization bias. Only 30–40% of embedding dimensions carry meaningful signal. This permits:
- Recovery of average token log-probabilities directly from output embeddings (average error <0.2 nats).
- Dimensionality pruning, yielding compression with negligible entropy/divergence increase.
6. Experimental Results, Performance, and Limitations
- In classic probabilistic scenarios (coin flip, dice roll, random card), LLMs’ response validity is perfect but token-level output probabilities may diverge by >0.1 to >0.75 from ground truth and entropies may err by up to 100% (Toney-Wails et al., 1 Nov 2025).
- Hallucination detection via variance attains per-token recall up to 72% for small models (GPT-Neo 125M) and ≥25% for large models (Mistral 7B) on reference-free, unanswerable QA (Kumar, 5 Jul 2025).
- Detection of LLM-generated code with probability maps as inputs to ViT/ResNet reaches AUC 0.98–0.99, with negligible runtime for the vision backbone (Xu et al., 6 Jan 2025).
- Claim-level fact verification with CCP outperforms standard entropy and max-prob baselines (English ROC-AUC up to 0.78 vs. 0.67–0.69 for others) (Fadeeva et al., 7 Mar 2024).
Potential limitations include overhead from multiple forward passes (variance-based methods), diminished informativeness in highly deterministic settings (short/closed-class outputs), and the necessity for domain- or language-specific calibration in some metrics. Pruning output embedding dimensions can be performed without observable degradation in output distribution.
7. Applications and Extensions
Token-level log-probability checking underpins:
- Robust uncertainty quantification for decision-support and mission-critical LLM deployment.
- Hallucination detection (reference-free) in both structured and open-ended outputs, with per-token flagging capability.
- Calibration studies of LLMs in stochastic or probabilistic tasks.
- Code provenance detection resilient to changes in code formatting and language.
- Compression and model analysis, leveraging sparsity in probability-carrying embedding dimensions.
- Fact verification pipelines that do not depend on external retrieval corpora, adaptable across languages and LLM families.
A plausible implication is that as LLMs are deployed in increasingly diverse and accountable roles, precise and interpretable token-level log-probability checking will become essential for fine-grained model evaluation, self-diagnosis, and trustworthy operation, especially where reference information is unavailable or where the theoretical distribution is sharply defined.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free