Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models (2508.10192v1)

Published 13 Aug 2025 in cs.CL, cs.AI, cs.LG, and q-fin.CP

Abstract: The proliferation of LLMs is challenged by hallucinations, critical failure modes where models generate non-factual, nonsensical or unfaithful text. This paper introduces Semantic Divergence Metrics (SDM), a novel lightweight framework for detecting Faithfulness Hallucinations -- events of severe deviations of LLMs responses from input contexts. We focus on a specific implementation of these LLM errors, {confabulations, defined as responses that are arbitrary and semantically misaligned with the user's query. Existing methods like Semantic Entropy test for arbitrariness by measuring the diversity of answers to a single, fixed prompt. Our SDM framework improves upon this by being more prompt-aware: we test for a deeper form of arbitrariness by measuring response consistency not only across multiple answers but also across multiple, semantically-equivalent paraphrases of the original prompt. Methodologically, our approach uses joint clustering on sentence embeddings to create a shared topic space for prompts and answers. A heatmap of topic co-occurances between prompts and responses can be viewed as a quantified two-dimensional visualization of the user-machine dialogue. We then compute a suite of information-theoretic metrics to measure the semantic divergence between prompts and responses. Our practical score, $\mathcal{S}_H$, combines the Jensen-Shannon divergence and Wasserstein distance to quantify this divergence, with a high score indicating a Faithfulness hallucination. Furthermore, we identify the KL divergence KL(Answer $||$ Prompt) as a powerful indicator of \textbf{Semantic Exploration}, a key signal for distinguishing different generative behaviors. These metrics are further combined into the Semantic Box, a diagnostic framework for classifying LLM response types, including the dangerous, confident confabulation.

Summary

The paper presents a novel framework that quantifies prompt-response semantic drift to assess hallucination fidelity in large language models using divergence metrics.
It utilizes joint clustering of sentence embeddings and multiple information-theoretic metrics, including Jensen-Shannon divergence and Wasserstein distance, for detailed response evaluation.
Experimental analysis demonstrates the framework’s capability to detect varying semantic shifts from factual to creative prompts, emphasizing its diagnostic utility.

Semantic Divergence Metrics for Faithfulness Hallucination Detection in LLMs

Introduction

The paper "Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in LLMs" introduces a novel framework for detecting intrinsic faithfulness hallucinations in LLMs. These hallucinations occur when model responses deviate significantly from the given input context. The proposed Semantic Divergence Metrics (SDM) framework seeks to quantify the semantic alignment between prompts and responses, offering a prompt-aware approach that improves upon the limitations of existing methods.

Methodology

Overview

The SDM framework utilizes joint clustering on sentence embeddings to establish a shared topic space for both prompts and responses, allowing for a detailed semantic comparison. The framework is designed to operate in a black-box setting, focusing on real-time detection of faithfulness hallucinations.

Key Components

Prompt-aware Testing: The framework generates multiple responses to semantically equivalent paraphrased prompts, deepening the analysis of response consistency.
Joint Embedding Clustering: Sentence embeddings from prompts and responses are clustered together to create a shared semantic topic space.
Semantic Divergence Metrics: A suite of information-theoretic metrics, including Jensen-Shannon divergence and Wasserstein distance, are computed to measure semantic drift.
Semantic Box Framework: This diagnostic tool integrates various metrics to classify different LLM response types, such as confident confabulations.

Algorithmic Framework

The paper presents a detailed, multi-step algorithm for implementing the SDM framework:

Data Generation and Embedding: Generate paraphrases and corresponding responses, and embed the sentences using a sentence-transformer.
Joint Clustering: Cluster the combined prompt-response embeddings to identify the shared topic space.
Metric Calculation: Compute diverse metrics, such as Jensen-Shannon divergence and Wasserstein distance, to assess semantic alignment.
Hallucination Score: The final hallucination score $\mathcal{S}_H$ is calculated as a weighted sum of the divergence metrics, normalized by prompt complexity.

Experimental Analysis

Experiment Set A: Stability Gradient

Prompts: Designed to test varying degrees of stability, from factual (Hubble) to interpretive (Hamlet), to creative (AGI Dilemma).
Findings: The framework effectively tracked the stability gradient, with higher $\mathcal{S}_H$ scores for more creative prompts.
Diagnostic Power: Visual heatmaps and co-occurrence distributions illustrated distinct response patterns across different stability scenarios.

Figure 1: Averaged Topic Co-occurrence Distributions for Experiment Set A. The heatmaps show the averaged joint probability $P_{\text{avg}(X,Y)$.

Experiment Set B: Diverse Prompt Types

Prompts: Spanned a range from factual to nonsensical, including a forced hallucination scenario.
Insights: Low $\mathcal{S}_H$ scores were observed for grounded responses, while the forced hallucination prompt revealed the model's confident evasion strategy.
Semantic Exploration: KL Divergence served as a critical indicator of the model's generative exploration under different constraints.

Figure 2: Averaged Topic Co-occurrence Distributions for Experiment Set B. The heatmaps show the averaged joint probability $P_{\text{avg}(X,Y)$.

Conclusion

The SDM framework advances the detection of hallucinations in LLMs by providing a prompt-aware, diagnostic approach grounded in information theory. It captures both thematic and content-level semantic shifts, offering insights into semantic exploration and response stability. While effective, the framework's context-dependent nature suggests future work on self-calibrating methods to enhance its applicability across diverse tasks. This paper presents a step towards more reliable and interpretable hallucination assessment for modern LLM deployments.