Fair Play, Coherence & Surprise Metrics
- Metrics for fair play, coherence, and surprise are a quantitative framework that evaluates narrative integrity in detective fiction via probabilistic reader models.
- The approach operationalizes genre conventions by contrasting a gullible and a know-it-all reader to measure surprise and coherence in the reveal of clues.
- Practical estimation using language models enables systematic evaluation and improvement of story generation by monitoring inferential divergence between readers.
Metrics for fair play, coherence, and surprise form the foundation of a quantitative framework for evaluating the narrative integrity and reader experience in detective fiction. These metrics enable formal analysis of the implicit “contract” between author and reader, encapsulating genre principles such as clue transparency, story logic, and the maintenance of uncertainty until revelation. Recent work has introduced a probabilistic framework that operationalizes these concepts and supplies practical and theoretically grounded measures for their assessment, particularly within the context of LLM–generated stories (Wagner et al., 18 Jul 2025).
1. Probabilistic Foundation for Narrative Evaluation
The formalism models a detective story as a finite sequence of paragraphs . The revelation point is defined as the earliest paragraph identifying the true culprit (denoted ). The universe of suspects is , with representing a prominent distractor.
Two generative processes are distinguished: an internal clue process with distribution
and an external surface narrative produced autoregressively by a story model (SM): Causally, the text encodes the clues, which in turn determine the true culprit (). Distinct idealized reader and detective models translate a prefix (of clues or text) into distributions over suspect assignment, enabling fine-grained comparison of divergent inferential strategies.
2. Definition and Operationalization of Fair Play
Fair play is construed not as a single closed-form quantity but as the simultaneous attainment of two desiderata. First, coherence: a maximally informed or “brilliant” reader should progressively gain support for the true culprit as clues accrue. Second, surprise: a “gullible” or heuristic-driven reader should be systematically misled, maintaining low credence in the true culprit until the revelation. This dual requirement mirrors the genre convention that a fair mystery should be solvable by attentive readers while remaining non-obvious to casual deduction.
3. Quantitative Metrics: Coherence, Surprise, and Fair Play
Let denote the internal/gullible reader and the external know-it-all reader. For paragraph position and story length , the fundamental metrics are:
- Surprise score (S):
representing the mean probability the gullible reader assigns to the true culprit. Lower values indicate higher surprise.
- Coherence score (C):
quantifying the mean probability the know-it-all assigns to the true culprit, with higher values reflecting better coherence.
- Fair play score (FP):
capturing the area between the two reader curves; higher FP indicates a better trade-off.
Auxiliary metrics are defined to deepen the analysis:
- Cross-entropy:
measuring how uninformed a reader is relative to the know-it-all.
- Clue-effectiveness:
indicating informativeness of step .
- Internal coherence up to :
- Intelligence gap:
“Strong surprise” is present if the reader’s cross-entropy exceeds (worse than chance); “weak surprise” if cross-entropy remains near (uninformed).
| Metric | Formula | Interpretation |
|---|---|---|
| Surprise (S) | Gullible belief in true culprit (low = surprise) | |
| Coherence (C) | Know-it-all belief in true culprit (high = coherence) | |
| Fair play (FP) | Degree of targeted misdirection with eventual solvability |
4. Theoretical Limits and the Coherence–Surprise Trade-off
The central theoretical result asserts a fundamental incompatibility for a single reader model between intense surprise and high intelligence (expressed as low intelligence gap ). Precisely, if a reader is sufficiently intelligent (i.e., draws robust inferences from the text), then strong surprise cannot be maintained. Conversely, if surprise is only weak (reader stays uninformed), internal coherence and surprise become mutually constraining. These results are formalized with inequalities bounding the attainable joint values of internal coherence, surprise, and intelligence gap, and demonstrate the necessity of simultaneously evaluating with at least two reader models to meaningfully quantify fair play (Wagner et al., 18 Jul 2025). This suggests that achieving classic mystery-genre properties mechanistically requires engineered reader-model divergence throughout the narrative arc.
5. Practical Estimation Procedures
Empirical estimation is facilitated for LLM-generated stories, where the governing story model can be sampled. The main protocol is as follows:
- At each paragraph , is provided to a “gullible” LLM (e.g., o1-mini), which is prompted to output a categorical distribution over suspects; the probability assigned to the actual culprit is recorded and averaged across the story for the surprise score.
- For coherence, given a known SM, independent continuations are sampled; for each full story, a judge LLM determines the designated culprit, enabling empirical estimation of .
- The fair play score is computed as the difference of these means.
- An auxiliary metric, Expected Revelation Content (ERC), quantifies the mutual information between pre- and post-reveal clues via a masked-paragraph multiple-choice task adjudicated by an LLM.
6. Implications for Narrative Generation and Evaluation
Quantitative fair play, coherence, and surprise metrics offer a principled basis for automatic evaluation and improvement in generative models of detective fiction. The separation of reader models is essential: effective narratives must be constructed so that weaker heuristics are intentionally misled, while stronger inference remains feasible, in line with “golden-age” fair play rules. These metrics, being reference-less, are suitable as objectives for training or reranking in LLM-based story generators, and allow for continuous monitoring of inferential divergence during interactive narrative planning. Since martingale-like probability estimation alone is insufficient for jointly high coherence and high surprise, explicit design interventions are required to maintain the required tension. A plausible implication is that fully automatable “fair play” evaluation is tractable for LLMs, but the narrative craft of targeted misdirection remains a nontrivial generative challenge (Wagner et al., 18 Jul 2025).