Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

143 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Automated Self-Consistency Evaluation for LLMs

Updated 1 July 2025

Automated self-consistency evaluation systematically measures a model's internal agreement across structured tasks to detect and quantify contradictions, enhancing trustworthiness and interpretability.
The process utilizes an inconsistency metric and employs methods like graph-based cycle removal or energy-based modeling to diagnose issues and correct inconsistent outputs.
This evaluation reveals current large language models exhibit widespread inconsistencies even on simple tasks like temporal or spatial reasoning, highlighting a challenge for reliable AI.

Automated self-consistency evaluation is the process of systematically measuring and, where feasible, improving the internal agreement of a model’s predicted relations across a set of structured, compositional tasks. The objective is to detect and quantify contradictions that undermine the trustworthiness and interpretability of LLMs. “Existing LLMs Are Not Self-Consistent For Simple Tasks” introduces a general framework for this evaluation, combines rigorously defined inconsistency metrics with automated graph- and energy-based post-processing, and investigates the persistence of inconsistency in current LLMs, even on basic reasoning setups.

1. Foundations: Self-Consistency and Its Importance

Self-consistency, in the AI context, refers to the absence of contradictions in a model’s internal reasoning about structured domains. For LLMs, this means the model should not simultaneously assert relations that cannot all be true together—for example, stating both “A is before B” and “B is before A,” or inferring inconsistent kinship, spatial, or temporal orders. Ensuring self-consistency is an essential quality for reliable, interpretable AI, as inconsistencies can propagate unpredictable errors or undermine user trust, especially in settings where explanations or auditability are required.

The paper focuses on simple but compositionally rich tasks: temporal sequences, spatial arrangements (e.g., identifying the order or directionality between cities on a map), and family tree reasoning. These tasks are ideal testbeds for self-consistency as their relations must obey strict transitive and compositional constraints.

2. Formalization: The Inconsistency Metric

The central quantitative tool is the inconsistency metric, which measures the proportion of a model’s output relations that conflict with a compositionally consistent set.

Let:

$N$ : total number of entities.
$\mathcal{M}$ : the set of all possible directed pairs (ordered relationships) among $N$ entities.
$A \subseteq \mathcal{M}$ : set of all relations predicted by the model.
$C \subseteq \mathcal{M}$ : any trusted context (which can range from empty to a full set of ground-truth relations).

The metric is defined as: $I(A; C) = \frac{\min_{B \in \mathcal{S}(C)} |A \setminus B|}{|\mathcal{M}|}$ where $\mathcal{S}(C)$ is the set of all supersets of $C$ that are composition-consistent (i.e., admit an ordering or structure without internal contradiction).

Interpretation: This is the minimum (normalized) number of the model’s predictions that need to be edited (removed) to restore internal consistency, respecting any given context. It is a “distance to feasibility” metric: lower is better, zero indicates perfect self-consistency.

In explicit cases:

With no trusted context ( $C = \varnothing$ ), $I(A; \varnothing)$ is determined by the minimal number of “reverse edges” (contradictory relations, e.g., $A < B$ and $B < A$ ) that must be deleted to admit a single ordering.
With full ground truth ( $C$ fully specifies order), $I(A; C)$ reduces to the normalized classification error rate.

Formula in special cases:

$I(A;\varnothing) = \frac{|E_{\mathrm{rev}}|}{N(N-1)/2}$

$I(A;C) = \frac{|E_{\mathrm{error}}|}{N(N-1)}$

where $E_{\mathrm{rev}}$ counts reverse (cycle) edges and $E_{\mathrm{error}}$ counts outright contradictions to context.

3. Automated Consistency Correction: Graph-Based and Energy-Based Methods

Two practical, theoretically motivated methods are introduced to post-process and “fix” self-inconsistent model outputs, offering both diagnostic and improvement tools.

A. Graph-Based Cycle Removal

This approach leverages the tight link between composition-consistency and directed acyclic structure:

Construct a Directed Graph where entities are nodes and predicted pairwise relations are directed edges.
Detect Strongly Connected Components (SCCs): Cyclic SCCs represent inconsistent reasoning (e.g., a set of entities where the model claims a circular order).
Identify Minimal Feedback Arc Set: Use a topological sort (by node in-degree) to approximate the smallest set of edges whose removal would break all cycles (analogous to finding minimal contradiction).
Restore Consistency: Remove (or flip) detected reverse edges, yielding a simply ordered (i.e., consistent) structure. The paper provides formal justification that this process produces a composition-consistent ordering, minimal in edit distance from the model outputs.

This algorithm is efficient for moderately sized tasks and is especially well-suited for one-dimensional and two-dimensional compositions (e.g., time lines, spatial directions).

B. Energy-Based Modeling (EBM)

The EBM-based approach frames the problem as finding a coordinate assignment (e.g., real numbers in 1D for time orders, 2D vectors for spatial layouts) to minimize a global energy:

Each object $i$ is given a coordinate $x_i$ .
For each asserted relation $r_{ij}$ (“ $i$ should precede $j$ ”), define:

$E(r_{ij}) = \max(0, 1 + (x_i - x_j))$

The total energy:

$E_{\mathrm{total}} = \sum_{r_{ij} \in \mathcal{R}} E(r_{ij})$

Coordinates are optimized (e.g., by gradient descent) to minimize $E_{\mathrm{total}}$ . After convergence, relations induced by the coordinate ordering are necessarily self-consistent.

EBM is generalizable and never produces cycles, ensuring strict composition-consistency.

4. Experimental Findings and Analysis

The authors apply these evaluation and correction methods to a suite of models on tasks including spatial and temporal order and family relations. All smaller models, and even SOTA models such as DeepSeek-R1 and GPT-o4-mini, manifest numerous contradictions—often in basic settings.

Key findings:

Inconsistencies are widespread even on “simple” tasks. No tested model was fully self-consistent.
Graph- and EBM-based corrections dramatically reduce but do not eliminate underlying error—residual inconsistencies correspond to cases of intractable model confusion or insufficient model structure.
Consistency scores and correction metrics are highly correlated ( $r > 0.98$ between graph- and EBM-based metrics), validating the approach.
The approach efficiently pinpoints “reverse edges” for targeted diagnosis, and applies to both domain-agnostic and domain-specific settings.

5. Implications for Reliability, Interpretability, and Model Development

Ensuring self-consistency is foundational for interpretable AI: consistent models are less likely to contradict themselves, support trustworthy explanations, and facilitate safe deployment in domains such as law, medicine, or scientific reasoning. The evaluation framework’s strengths include:

Diagnostic clarity: Automated extraction of minimal fixes reveals systematic failure modes and helps direct attention to necessary architectural or data improvements.
Interpretability: By enforcing global consistency via post-processing, explanations and derived outputs are made more intelligible and reliable, even if baseline model predictions are noisy.
Limitation recognition: Self-consistency is necessary but not sufficient for factual correctness—models can be consistently wrong. The framework distinguishes between error types and informs where factual ground-truth accuracy must be ensured.
Structural insight: The graph- and energy-based corrections elucidate the structural properties of LLM reasoning, especially in compositional domains.

6. Challenges in Achieving Full Self-Consistency

The paper highlights fundamental and practical barriers:

NP-completeness of optimal correction: The feedback arc set problem is computationally intractable for very large graphs, but approximations suffice for the studied tasks.
Intrinsic architectural limitations: Autoregressive, left-to-right LLMs do not internally enforce bidirectional or compositional constraints, making global consistency difficult to guarantee.
Limits of finetuning: Standard supervised approaches cannot, by themselves, always yield globally consistent reasoning.

The need for advanced architectures or targeted learning principles—possibly inspired by compositional or categorical mathematics—is underscored.

7. Representative Formulations and Algorithms

Inconsistency Score

$I(A;C) = \frac{\min_{B \in \mathcal{S}(C)} |A \setminus B|}{|\mathcal{M}|}$

with application-dependent details for $C = \varnothing$ or fully known.

Graph-based Cycle Removal (Sketch)

Build directed graph from relations.
Topological sort to identify order, remove “reverse” edges.
Output minimal consistent order as a fixed model.

EBM Objective

$E_{\mathrm{total}} = \sum_{r_{ij} \in \mathcal{R}} \max(0, 1 + (x_i - x_j))$

With coordinates updated via gradient descent.

8. Synthesis and Outlook

Automated self-consistency evaluation, as implemented and analyzed in the referenced work, demonstrates that state-of-the-art LLMs often fail to maintain even basic global consistency across simple relational reasoning tasks. The proposed metrics and correction algorithms provide both actionable diagnostics and partial post-hoc remedies; however, persistent inconsistencies signal a deeper challenge inherent in current model paradigms.

Self-consistency is a nontrivial and non-optional requirement for reliable, interpretable AI. Robust evaluation and targeted correction, as described, are necessary interim steps toward safer and more trustworthy LLMs.

PDF Markdown Chat (Upgrade)