Papers
Topics
Authors
Recent
Search
2000 character limit reached

The Consistency Dilemma in LLMs: Generator-Evaluator Agreement and Vulnerability to Mistakes

Published 16 Jun 2026 in cs.CY and cs.AI | (2606.30653v1)

Abstract: LLMs are increasingly deployed in agentic pipelines that depend on the model evaluating its own outputs without external verification. The reliability of these pipelines depends on an implicit assumption: that the model applies relevant concepts the same way when it generates an output and later evaluates that output. We propose a new measure, generator-evaluator self-consistency, to test this assumption directly and apply it to 10 frontier models across 491 concepts. We find, first, that there is substantial variation in self-consistency. Second, we find that in a clinical setting with physician-validated mistakes (Proniakin et al., 2025), across models, those with higher self-consistency are linked to greater vulnerability to mistakes. Thus, even when models consistently apply concepts they may not be safe to deploy. This is evidence of a consistency dilemma in LLMs: self-consistency is operationally useful, but models that are more consistent are also more prone to mistakes.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.