Papers
Topics
Authors
Recent
Search
2000 character limit reached

SpikeScore: Hallucination Detection in LLMs

Updated 3 February 2026
  • SpikeScore is a quantitative metric that identifies hallucinations in large language models by analyzing abrupt second-order differences in uncertainty scores over multi-turn dialogues.
  • It computes the maximum absolute curvature in the sequence of scores, distinguishing factual responses from hallucinated ones with statistical guarantees and high AUROC.
  • Empirical evaluations show SpikeScore’s robust performance across domains and models, outperforming traditional detection probes in noisy and adversarial scenarios.

SpikeScore is a quantitative metric for @@@@2@@@@ in LLMs under cross-domain generalization constraints. By measuring abrupt curvature in the trajectory of uncertainty or truthfulness scores during multi-turn self-dialogue, SpikeScore provides a domain-agnostic signal to differentiate hallucinated from factual LLM continuations. Empirical and theoretical analyses demonstrate that SpikeScore achieves state-of-the-art cross-domain separability, outperforming traditional detection probes and recent generalization-focused baselines (Deng et al., 27 Jan 2026).

1. Formal Definition and Mathematical Formulation

Given a sequence S=(S1,,SK)S = (S^1, \ldots, S^K) of per-turn uncertainty or truthfulness scores, as assigned by a probe over a KK-turn LLM-generated dialogue, SpikeScore is formally defined via the discrete second-order difference operator:

Δ2Sk=Sk+12Sk+Sk1(1<k<K).\Delta^2 S^k = S^{k+1} - 2 S^k + S^{k-1} \quad (1 < k < K).

The SpikeScore for the sequence SS is

SpikeScore(S)=max1<k<KΔ2Sk,\mathrm{SpikeScore}(S) = \max_{1 < k < K} |\Delta^2 S^k|,

i.e., the maximum absolute "curvature" of the score series. Large SpikeScore values correspond to abrupt rises and falls in the probe's confidence, which are characteristic signatures of hallucination-initiated multi-turn continuations (Deng et al., 27 Jan 2026).

2. Multi-Turn Self-Dialogue Simulation Protocol

SpikeScore relies on simulated multi-turn self-dialogue by the LLM. For an input question QQ and the LLM's initial answer A1A^1, responses are recursively elicited as follows:

AiPθ(Q,A1,P2,A2,,Pi1,Ai1,Pi),A^i \sim \mathbb{P}_\theta(\cdot \mid Q, A^1, P^2, A^2, \ldots, P^{i-1}, A^{i-1}, P^i),

where prompts P2,,PKP^2, \ldots, P^K are sampled from a library spanning diverse types (e.g., Encouraging, Analytical, Stepwise) and are sequenced to increase “strength” over turns. At each step, a probe pWp_W assigns a score S(Ai)=pW(Eθ(Ai))[0,1]S(A^i) = p_W(E_\theta(A^i \mid \cdots)) \in [0, 1]. Empirical score plots reveal that hallucinated continuations induce prominent spikes in Δ2\Delta^2, while factual dialogues exhibit smoother profiles, a phenomenon formalized by SpikeScore (Deng et al., 27 Jan 2026).

3. Theoretical Separability Guarantees

Under mild moment conditions, the SpikeScore statistic offers provable probabilistic separability between hallucination and non-hallucination distributions in mixtures of test domains. Denoting the SpikeScore by XX for hallucination and YY for non-hallucination samples:

  • E[X]>2E[Y]\mathbb{E}[X] > 2 \mathbb{E}[Y] (mean gap)
  • Std[X]/Std[Y]2.5\mathrm{Std}[X]/\mathrm{Std}[Y] \leq 2.5
  • Coefficient of variation (CV) of YY (Std[Y]/E[Y]\mathrm{Std}[Y]/\mathbb{E}[Y]) 0.2\leq 0.2

Cantelli’s inequality establishes that

Pr[X>Y](δ1)2(r2+1)c2+(δ1)211+0.0725t2,\Pr[X > Y] \geq \frac{(\delta-1)^2}{(r^2 + 1)\,c^2 + (\delta-1)^2} \geq \frac{1}{1 + 0.0725\,t^2},

where δ=E[X]/E[Y]\delta = \mathbb{E}[X]/\mathbb{E}[Y], r=Std[X]/Std[Y]r = \mathrm{Std}[X]/\mathrm{Std}[Y], c=Std[Y]/E[Y]c = \mathrm{Std}[Y]/\mathbb{E}[Y], t>0t > 0. For t=2t = 2, this yields Pr[X>Y]0.775\Pr[X > Y] \geq 0.775. This result indicates that, with high probability, SpikeScore ranks a hallucinated chain above a factual chain across domains (Deng et al., 27 Jan 2026).

4. Experimental Evaluation

SpikeScore’s efficacy is validated on diverse datasets and models:

Dataset Domain Type
TriviaQA Knowledge QA
CommonsenseQA Commonsense MCQ
Belebele Multilingual RC
CoQA Conversational QA
MATH Competition Math
SVAMP Word Problems
Model Configuration
Llama-3.2-3B-Instruct temp=0.2, p=0.9
Llama-3.1-8B-Instruct temp=0.2, p=0.9
Qwen3-8B-Instruct reasoning-mode, t=0.6, p=0.95
Qwen3-14B-Instruct same as 8B

Protocol: Training is performed on a single dataset, with leave-one-out cross-domain evaluation over the remaining five (K = 20 turns). Hallucination detection performance is measured by AUROC. SpikeScore, when paired with SAPLMA as probe, achieves mean AUROC 0.786\approx 0.786 (Llama-3.1-8B) to $0.787$ (Qwen3-14B) — outperforming both training-free baselines (mean \sim0.70) and cross-domain specialized methods (mean \sim0.74) (Deng et al., 27 Jan 2026).

SpikeScore maintains robust performance under noisy retrieval-augmented generation (RAG) scenarios (TriviaQA, RAGTruth), reaching AUROC up to 0.87. Its backbone-agnostic design yields consistent cross-domain gains with probes including Perplexity, Reasoning Score, ICS, SEP, and SAPLMA.

5. Algorithmic Description and Parameterization

The SpikeScore detection algorithm proceeds as follows:

  1. Initialize dialogue history with (Q,A)(Q, A).
  2. For ii in 2K2 \ldots K:
    • Sample prompt PiP_i (weak-to-strong schedule).
    • Obtain Ai=LLM_generate(HPi)A_i = \mathrm{LLM\_generate}(H \parallel P_i).
    • Append (Pi,Ai)(P_i, A_i) to HH.
    • Compute S[i]=pW(Eθ(AiH))S[i] = p_W(E_\theta(A_i \mid H)).
  3. Compute s=max1<i<KS[i+1]2S[i]+S[i1]s = \max_{1 < i < K} |S[i+1] - 2 S[i] + S[i-1]|.
  4. Output y=1y = 1 (hallucination) if sλs \geq \lambda; else $0$.

Key parameters:

  • K=20K=20 dialogue turns (detectability saturates at 15–20; longer chains may introduce drift)
  • λ\lambda (threshold) tuned on validation split (\sim0.1–0.3)
  • Prompt library: five strength levels across eight prompt types (Encouraging→Reflective).

6. Statistical Phenomena and Empirical Properties

Empirical findings substantiating SpikeScore’s cross-domain robustness include:

  • Mean SpikeScore for hallucinations exceeds 2×2 \times the non-hallucination mean (Observation 1).
  • Standard deviation ratio for SpikeScore between hallucination and non-hallucination chains is at most 2.5 (Obs 2).
  • Coefficient of variation of factual SpikeScore is consistently 0.2\leq 0.2 across leave-one-out settings (Obs 3) (Deng et al., 27 Jan 2026).

These properties underpin the observed cross-domain generalization, as the underlying spike phenomenon persists across question genres, languages, and model variants.

7. Limitations and Prospective Directions

Documented limitations of SpikeScore include:

  • Dialogue-length Sensitivity: Factual continuations may yield spurious spikes for K>20K > 20, necessitating budgets of 15–20 turns.
  • Prompt-Design Dependence: Although moderately robust to prompt seeds, adversarial or out-of-library prompt regimes present an open challenge.
  • Robustness to Alignment Attacks: Mild “polite” instructions have little effect, while aggressive instruction tuning or adversarial alignment may attenuate the spike signal.
  • Threshold Calibration: While λ\lambda is generally stable, its automatic adjustment for zero-shot domains requires further research.

Possible extensions:

  • Application to structured prediction pipelines (e.g., tool use, chain-of-thought prompting).
  • Exploration of higher-order temporal differences or alternative sequence curvature statistics.
  • Integration with calibrated confidence estimators for deployment monitoring.
  • Adaptation to multilingual or multimodal LLMs, to accommodate modality-specific hallucination signatures.

SpikeScore operationalizes multi-turn instability as a domain-invariant hallmark of hallucination, providing theoretically justified and empirically validated detection in cross-domain settings (Deng et al., 27 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpikeScore.