Semantic Faithfulness and Entropy Production Measures to Tame Your LLM Demons and Manage Hallucinations (2512.05156v1)

Published 4 Dec 2025 in cs.AI, cs.CL, cs.IT, cs.LG, and q-fin.CP

Abstract: Evaluating faithfulness of LLMs to a given task is a complex challenge. We propose two new unsupervised metrics for faithfulness evaluation using insights from information theory and thermodynamics. Our approach treats an LLM as a bipartite information engine where hidden layers act as a Maxwell demon controlling transformations of context $C $ into answer $A$ via prompt $Q$. We model Question-Context-Answer (QCA) triplets as probability distributions over shared topics. Topic transformations from $C$ to $Q$ and $A$ are modeled as transition matrices ${\bf Q}$ and ${\bf A}$ encoding the query goal and actual result, respectively. Our semantic faithfulness (SF) metric quantifies faithfulness for any given QCA triplet by the Kullback-Leibler (KL) divergence between these matrices. Both matrices are inferred simultaneously via convex optimization of this KL divergence, and the final SF metric is obtained by mapping the minimal divergence onto the unit interval [0,1], where higher scores indicate greater faithfulness. Furthermore, we propose a thermodynamics-based semantic entropy production (SEP) metric in answer generation, and show that high faithfulness generally implies low entropy production. The SF and SEP metrics can be used jointly or separately for LLM evaluation and hallucination control. We demonstrate our framework on LLM summarization of corporate SEC 10-K filings.

Summary

The paper introduces two novel black-box metrics, Semantic Faithfulness (SF) and Semantic Entropy Production (SEP), to quantify LLM alignment and hallucination risk.
It employs a framework grounded in information theory and thermodynamics, using convex optimization over sentence embedding clusters to derive interpretable topic transitions.
Empirical analyses reveal significant correlations between question entropy, SF, and SEP, offering actionable insights for robust LLM evaluation and prompt engineering.

Semantic Faithfulness and Entropy Production for LLM Evaluation

Introduction

This work introduces a principled framework for quantitative, reference-free evaluation of LLM faithfulness and hallucination detection, based on information-theoretic and thermodynamic foundations. The central contribution is two novel black-box metrics: Semantic Faithfulness (SF), which quantifies alignment between the question-induced and answer-induced transformations of context, and Semantic Entropy Production (SEP), which characterizes the thermodynamic irreversibility of the answer generation channel.

The paper conceptualizes LLMs as bipartite information engines, invoking the Maxwell's demon paradigm: a hidden controller mediates the transformation of input context $C$ into answer $A$ under the specification of a question $Q$ . Topical representations are employed to construct marginal distributions over latent clusters for each element in the triplet, leading to the inference of optimal topic transition matrices via convex optimization. The resulting framework is both theoretically principled and practically feasible, requiring only sentence embeddings and standard optimization.

Topical Information Flow and Faithfulness Metric

The approach formalizes each QCA triplet $(Q, C, A)$ as a set of marginal distributions $(p^{(q)}, p^{(c)}, p^{(a)})$ over semantic topics identified via clustering in embedding space. Two key row-stochastic transition matrices, $Q$ and $A$ , encode the mapping of context topics to question and answer topics, respectively. The Semantic Faithfulness score is then defined as

$\mathcal{F}_S = \frac{1}{1 + D_{min}},$

where $D_{min}$ is the minimum Kullback-Leibler divergence $D(A || Q)$ over all pairs of feasible matrices matching the constraints induced by the marginals. This setup exploits information geometry—a key property being joint convexity, ensuring global optimality via alternating minimization. High $\mathcal{F}_S$ signals strong alignment of the semantic transformation specified by the prompt with that realized in the answer.

Figure 2: Optimal transition matrices $\mathbf{Q}^{\star}$ and $\mathbf{A}^{\star}$ , both exhibiting sparse, interpretable alignments between context topics and those in question/answer.

The transition matrices themselves serve as interpretable artifacts that reveal how context is projected semantically into both question and answer spaces.

Thermodynamic Interpretation: Semantic Entropy Production

LLM answer generation is further analyzed through the lens of stochastic thermodynamics. The transition from $C$ to $A$ is treated as an out-of-equilibrium process, and SEP is derived as the minimal KL divergence between the forward (optimal) and a class of reverse transition matrices. The total entropy production $\dot{S}_{\text{tot}}$ decomposes into:

$\dot{S}_{\text{tot}} = H(p^{(a)}) - H(p^{(c)}) + \dot{S}_m$

where $\dot{S}_m$ is the dissipated "heat"—which, when negative, indicates the LLM draws on its internal knowledge base to produce more informative (higher entropy) answers than available in context alone.

Figure 4: SEP decomposition shows negative $\dot{S}_m$ for many QCA triplets, supporting the interpretation of knowledge injection from the model.

The SEP provides a theoretically motivated alternative to the recently proposed semantic entropy metric [farquhar2024semantic], incorporating context complexity inherently and capturing process irreversibility.

Empirical Evaluation

The framework was validated on 10 QCA triplets constructed from NVIDIA's 2024 10-K risk factor disclosures, probing both comprehensive and competitive-risk-focused questions.

Semantic distributions: Sentence embeddings (Qwen3-Embedding-0.6B) are clustered into 23 topics (UDIB). Probabilities are derived via cluster assignments of sentences.
Algorithm: Convex alternating minimization for $A$ and $Q$ (SF). Dual Lagrangian maximization for reverse-channel SEP lower bound.
Primary findings:
- Both question structure and entropy affect faithfulness. A positive Pearson’s $r = 0.695$ is observed between question entropy $H(Q)$ and $\mathcal{F}_S$ .
- SEP and SF are negatively correlated ( $r = -0.612$ ), but the linear relationship deviates from the naive $1/\mathcal{F}_S-1$ approximation, supporting their complementarity.
- Figure 5: Higher question entropy is associated with increased semantic faithfulness. Group A (multi-topic) shows broader variation.
- Figure 3: Negative association between $\mathcal{F}_S$ and SEP (more faithful answers are thermodynamically less irreversible), with different regimes for question types.
- Figure 6: Marginal topic distributions in triplet A0, illustrating how sparse question demands induce focused answer topics from a diffuse context.

Comprehensive questions evoke greater semantic expansion (higher $H(A)-H(C)$ ) and SEP, while focused competitive questions manifest low-dispersion, low entropy changes.

Qualitative and LLM-as-a-Judge Analysis

LLM-based evaluation (Claude Sonnet 4.5) was used to adjudicate SF-guided answer selection. While high and low $\mathcal{F}_S$ answers exhibited equivalent human-perceived quality by standard criteria (faithfulness, completeness, coherence, relevance), additional experiments revealed SF can sometimes expose semantic drift or hallucination undetected by surface-level metrics. Thus, $\mathcal{F}_S$ provides a distinct axis of evaluative insight, emphasizing alignment in information flow rather than stylistic or surface coverage.

Theoretical and Practical Implications

This framework addresses several key limitations of existing evaluation protocols:

Reference-free assessment: SF does not require gold references, providing a rigorous alternative to BLEU/ROUGE.
Disentanglement of prompt and context: SEP and SF implicitly factor context complexity, obviating the pathologies of marginal-answer-entropy-only methods.
Hallucination control: Large SEP or low SF values signal semantic drift indicative of hallucination, enabling robust candidate selection or post-hoc detection.
Explainability: The inferred transition matrices and topic probabilities provide interpretable windows into LLM operation.
Scalability: Methods are compatible with any embedding model and are highly efficient due to convexity.
Figure 1: Scatter plot over many synthetic QCA triplets, validating the general negative SF–SEP trend but demonstrating nontrivial residual variability.

Future Directions

Scaling and domain transfer: Application to larger, more diverse datasets and different domains.
RAG and multi-turn: Adaptation to retrieval-augmented architectures and multi-step dialogues.
Robust uncertainty quantification: Integration with Bayesian uncertainty and epistemic risk monitors.
Prompt and answer optimization: Using metrics for systematic prompt engineering and answer curation.

Conclusion

This work establishes an information-theoretic and thermodynamic foundation for evaluating LLM semantic faithfulness and hallucination risk, operationalized through the black-box metrics of SF and SEP (2512.05156). These advances extend the semantic divergence framework [halperin2025sdm, halperin2025dib], providing practical, theoretically justified tools for answer selection, algorithmic governance, and empirical audit in high-stakes LLM deployement. The joint use of SF and SEP yields richer diagnostics than either metric alone, with broad implications for both foundational LLM analysis and downstream applications.