SpikeScore: Hallucination Detection in LLMs
- SpikeScore is a quantitative metric that identifies hallucinations in large language models by analyzing abrupt second-order differences in uncertainty scores over multi-turn dialogues.
- It computes the maximum absolute curvature in the sequence of scores, distinguishing factual responses from hallucinated ones with statistical guarantees and high AUROC.
- Empirical evaluations show SpikeScore’s robust performance across domains and models, outperforming traditional detection probes in noisy and adversarial scenarios.
SpikeScore is a quantitative metric for @@@@2@@@@ in LLMs under cross-domain generalization constraints. By measuring abrupt curvature in the trajectory of uncertainty or truthfulness scores during multi-turn self-dialogue, SpikeScore provides a domain-agnostic signal to differentiate hallucinated from factual LLM continuations. Empirical and theoretical analyses demonstrate that SpikeScore achieves state-of-the-art cross-domain separability, outperforming traditional detection probes and recent generalization-focused baselines (Deng et al., 27 Jan 2026).
1. Formal Definition and Mathematical Formulation
Given a sequence of per-turn uncertainty or truthfulness scores, as assigned by a probe over a -turn LLM-generated dialogue, SpikeScore is formally defined via the discrete second-order difference operator:
The SpikeScore for the sequence is
i.e., the maximum absolute "curvature" of the score series. Large SpikeScore values correspond to abrupt rises and falls in the probe's confidence, which are characteristic signatures of hallucination-initiated multi-turn continuations (Deng et al., 27 Jan 2026).
2. Multi-Turn Self-Dialogue Simulation Protocol
SpikeScore relies on simulated multi-turn self-dialogue by the LLM. For an input question and the LLM's initial answer , responses are recursively elicited as follows:
where prompts are sampled from a library spanning diverse types (e.g., Encouraging, Analytical, Stepwise) and are sequenced to increase “strength” over turns. At each step, a probe assigns a score . Empirical score plots reveal that hallucinated continuations induce prominent spikes in , while factual dialogues exhibit smoother profiles, a phenomenon formalized by SpikeScore (Deng et al., 27 Jan 2026).
3. Theoretical Separability Guarantees
Under mild moment conditions, the SpikeScore statistic offers provable probabilistic separability between hallucination and non-hallucination distributions in mixtures of test domains. Denoting the SpikeScore by for hallucination and for non-hallucination samples:
- (mean gap)
- Coefficient of variation (CV) of ()
Cantelli’s inequality establishes that
where , , , . For , this yields . This result indicates that, with high probability, SpikeScore ranks a hallucinated chain above a factual chain across domains (Deng et al., 27 Jan 2026).
4. Experimental Evaluation
SpikeScore’s efficacy is validated on diverse datasets and models:
| Dataset | Domain Type |
|---|---|
| TriviaQA | Knowledge QA |
| CommonsenseQA | Commonsense MCQ |
| Belebele | Multilingual RC |
| CoQA | Conversational QA |
| MATH | Competition Math |
| SVAMP | Word Problems |
| Model | Configuration |
|---|---|
| Llama-3.2-3B-Instruct | temp=0.2, p=0.9 |
| Llama-3.1-8B-Instruct | temp=0.2, p=0.9 |
| Qwen3-8B-Instruct | reasoning-mode, t=0.6, p=0.95 |
| Qwen3-14B-Instruct | same as 8B |
Protocol: Training is performed on a single dataset, with leave-one-out cross-domain evaluation over the remaining five (K = 20 turns). Hallucination detection performance is measured by AUROC. SpikeScore, when paired with SAPLMA as probe, achieves mean AUROC (Llama-3.1-8B) to $0.787$ (Qwen3-14B) — outperforming both training-free baselines (mean 0.70) and cross-domain specialized methods (mean 0.74) (Deng et al., 27 Jan 2026).
SpikeScore maintains robust performance under noisy retrieval-augmented generation (RAG) scenarios (TriviaQA, RAGTruth), reaching AUROC up to 0.87. Its backbone-agnostic design yields consistent cross-domain gains with probes including Perplexity, Reasoning Score, ICS, SEP, and SAPLMA.
5. Algorithmic Description and Parameterization
The SpikeScore detection algorithm proceeds as follows:
- Initialize dialogue history with .
- For in :
- Sample prompt (weak-to-strong schedule).
- Obtain .
- Append to .
- Compute .
- Compute .
- Output (hallucination) if ; else $0$.
Key parameters:
- dialogue turns (detectability saturates at 15–20; longer chains may introduce drift)
- (threshold) tuned on validation split (0.1–0.3)
- Prompt library: five strength levels across eight prompt types (Encouraging→Reflective).
6. Statistical Phenomena and Empirical Properties
Empirical findings substantiating SpikeScore’s cross-domain robustness include:
- Mean SpikeScore for hallucinations exceeds the non-hallucination mean (Observation 1).
- Standard deviation ratio for SpikeScore between hallucination and non-hallucination chains is at most 2.5 (Obs 2).
- Coefficient of variation of factual SpikeScore is consistently across leave-one-out settings (Obs 3) (Deng et al., 27 Jan 2026).
These properties underpin the observed cross-domain generalization, as the underlying spike phenomenon persists across question genres, languages, and model variants.
7. Limitations and Prospective Directions
Documented limitations of SpikeScore include:
- Dialogue-length Sensitivity: Factual continuations may yield spurious spikes for , necessitating budgets of 15–20 turns.
- Prompt-Design Dependence: Although moderately robust to prompt seeds, adversarial or out-of-library prompt regimes present an open challenge.
- Robustness to Alignment Attacks: Mild “polite” instructions have little effect, while aggressive instruction tuning or adversarial alignment may attenuate the spike signal.
- Threshold Calibration: While is generally stable, its automatic adjustment for zero-shot domains requires further research.
Possible extensions:
- Application to structured prediction pipelines (e.g., tool use, chain-of-thought prompting).
- Exploration of higher-order temporal differences or alternative sequence curvature statistics.
- Integration with calibrated confidence estimators for deployment monitoring.
- Adaptation to multilingual or multimodal LLMs, to accommodate modality-specific hallucination signatures.
SpikeScore operationalizes multi-turn instability as a domain-invariant hallmark of hallucination, providing theoretically justified and empirically validated detection in cross-domain settings (Deng et al., 27 Jan 2026).