AA-Omniscience Benchmark

Updated 18 November 2025

AA-Omniscience Benchmark is a comprehensive framework that integrates epistemic logic, information theory, and machine learning to quantify both finite (attainable) and infinite (full) knowledge.
It distinguishes between finite evidence-based deductions and global introspection, enabling precise evaluation of knowledge reliability and communication efficiency.
Empirical benchmarks, including an Omniscience Index, assess factual recall and calibration in large language models across diverse domains to guide robust AI design.

The AA-Omniscience Benchmark provides a rigorous, multifaceted framework for evaluating omniscience and knowledge reliability in computational systems. The name encompasses both formal models from epistemic logic—specifically the interplay between attainable (finite evidence-based) and full (infinite evidence-based) knowledge—and concrete empirical benchmarks testing cross-domain factuality and self-calibration in LLMs. AA-Omniscience unifies modal logic, information theory, and applied machine learning measurement under a common goal: to quantify not only what a system knows, but also its awareness of its own knowledge gaps and its efficiency in acquiring or exchanging knowledge.

The epistemic starting point for AA-Omniscience is the bi-modal logic distinguishing between "attainable knowledge" ( $K_a$ )—that which can be deduced from a finite subset of evidence—and "omniscience" ( $K_o$ ), based on the totality of potentially infinite evidence. The syntactic structure employs formulae of the form $\phi, \psi ::= p \mid \neg\phi \mid (\phi \rightarrow \psi) \mid K_a \phi \mid K_o \phi$ , interpreted within evidence-based models $M = \langle W, E, \{\sim_e\}_{e\in E}, \pi\rangle$ with $W$ (worlds), $E$ (evidence pieces), indexed indistinguishability relations, and propositional atom valuations (Naumov et al., 2017).

The modalities are semantically characterized as:

$w \vDash K_a \phi$ iff there exists finite $F \subseteq E$ such that for all $u\in W$ , $w \sim_F u \Rightarrow u \vDash \phi$ ,
$w \vDash K_o \phi$ iff for all $u\in W$ , $w \sim_E u \Rightarrow u \vDash \phi$ .

The system's axiomatics combine S5 for $K_o$ and S4 for $K_a$ , supplemented by monotonicity ( $K_o A \rightarrow K_a A$ ) and a "mixed 5" axiom ( $\neg K_a A \rightarrow K_o \neg K_a A$ ), with corresponding soundness and completeness results established by canonical evidence-model constructions. This formal apparatus clarifies distinctions between knowledge attainable with bounded evidence access and knowledge requiring full closure or global introspection (Naumov et al., 2017).

2. Communication for Omniscience: Information-Theoretic Models and Algorithms

In information theory, the AA-Omniscience paradigm is reflected in the communication for omniscience (CO) problem: distributed users, each with partial knowledge of a random source, wish to exchange information to attain "omniscience" of the full source at minimum communication cost. For $m$ users observing random variables $(X_1, \dots, X_m)$ with joint distribution, achievable rate vectors and sum-rate minima are characterized by Slepian–Wolf type inequalities:

$R(S) \geq H(X_S \mid X_{S^c}) \quad \forall S \subset M$

[c.f. (Milosavljevic et al., 2011, Ding et al., 2016)]. The minimum sum-rate is computed via set function optimization, Dilworth truncations, and efficient combinatorial algorithms such as Modified Edmonds or the Modified Decomposition Algorithm (MDA), all supported by submodular function minimization primitives.

Key algorithms and results include:

Polynomial-time determination of optimal rates and partitions via iterative or greedy methods (Milosavljevic et al., 2011, Ding et al., 2016).
Construction of explicit network codes for achieving omniscience in both uncoded and linearly correlated packet models.
Weighted generalizations to minimize application-specific communication costs.
Empirical benchmarks comparing sum-rates, run-time, and partition complexity across parameter regimes.

This information-theoretic foundation serves as a foundational testbed for evaluating efficiency and robustness of omniscience-attaining protocols, providing a natural "ground truth" for the attainable knowledge limit in distributed data systems (Milosavljevic et al., 2011, Ding et al., 2016).

3. Applied Benchmark: Factual Recall and Calibration in LLMs

The contemporary operationalization of AA-Omniscience is embodied in the benchmark "AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in LLMs" (Jackson et al., 17 Nov 2025). This suite measures both factual recall and knowledge calibration across 6,000 questions spanning 42 subtopics within six highly relevant domains: Business, Humanities & Social Sciences, Health, Law, Software Engineering, and Science/Engineering/Mathematics.

Benchmark construction draws exclusively from authoritative, first-party, or widely recognized sources. Question design is automated (using GPT-5 as the generation agent), with subsequent filtering to ensure difficulty, clarity, unambiguous answerability, and up-to-date factuality. This produces a challenging, scalable, and domain-extensible testbed.

Central to evaluation is the "Omniscience Index" (OI), a score defined as

$\text{OI} = 100 \cdot \frac{c - i}{c + p + i + a}$

where $c$ = correct, $p$ = partial, $i$ = incorrect, $a$ = abstentions. This penalty/reward structure sharply distinguishes high-accuracy hallucination-prone models from more conservative systems, rewarding both factuality and calibrated uncertainty (Jackson et al., 17 Nov 2025).

4. Benchmark Methodologies and Metrics

AA-Omniscience evaluates system performance using several key metrics and design patterns across both logic-based and applied frameworks:

Granular Evaluations: Metrics are computed overall, per domain, and per topic, exposing domain-specialized strengths and weaknesses (Jackson et al., 17 Nov 2025).
Evidence-Extraction and Omniscience Problems: In logic-based settings, tasks are organized into finite evidence extraction ( $K_a$ -type) and global omniscience or introspection ( $K_o$ -type) benchmarks (Naumov et al., 2017).
Empirical Metrics:
- Minimum sum-rate (information theory) or OI (LLMs).
- Accuracy, hallucination rates, and abstention frequency.
- Algorithm run-time, submodular function minimization (SFM) call counts, and partition complexity (Ding et al., 2016).
Test Case Archetypes: Includes standard constructs such as the Hilbert Hotel—attainable via $K_a$ for "vacancy," but only via $K_o$ for "fullness"—to separate and diagnose attainable vs. omniscient reasoning (Naumov et al., 2017).
Algorithmic Reproducibility: Open specification of input distributions, random seeds, SFM libraries, and platform characteristics, supporting analytic reproducibility (Ding et al., 2016).

These methodological practices ensure AA-Omniscience can serve as both a specification benchmark and a tool for detecting calibration and reasoning limitations in existing and future systems.

5. Experimental Results and Comparative Analysis

Experimental findings from the AA-Omniscience LLM benchmark reveal persistent factuality and calibration deficits among frontier models (Jackson et al., 17 Nov 2025):

Out of 36 evaluated models, only three (Claude 4.1 Opus, GPT-5.1, Grok 4) achieve an OI above zero, with highest observed values near 4.8 (Claude 4.1 Opus).
Models with high raw accuracy often have large hallucination rates, which, due to the symmetric penalty structure of the OI, result in negative or near-zero scores.
Performance is highly domain-dependent; no single model dominates all domains, and smaller models sometimes outperform larger ones due to lower hallucination rates.
The Artificial Analysis Intelligence Index (general capability) is decorrelated from OI, highlighting that general task performance is not a reliable proxy for knowledge reliability.
Empirical results also support the argument that scale alone does not guarantee omniscience-calibrated performance.

For information-theoretic omniscience, polynomial-time algorithms match information-theoretic lower bounds up to block-length constraints, and explicit constructions can be achieved deterministically for all standard side-information models (Milosavljevic et al., 2011, Ding et al., 2016). Empirical benchmarks confirm tightness of theoretical cuts and practical tractability across a variety of distributed knowledge scenarios.

6. Limitations and Directions for Extension

Identified limitations of current AA-Omniscience formulations (Jackson et al., 17 Nov 2025):

The question set in LLM benchmarks is English-centric and predominately based on US/UK sources, necessitating expansion for global and multilingual coverage.
The reliance on a single automated question-generation model (GPT-5) may introduce systemic bias, suggesting future use of ensembles for diversification.
In the current OI, all incorrect answers are penalized uniformly; alternate penalty structures (e.g., softer partial penalties) are under consideration to better capture nuanced reliability differences.
Formal logic-based benchmarks can be further expanded to encompass richer classes of introspective or adversarial evidence scenarios (Naumov et al., 2017).

Information-theoretic omniscience frameworks could be further generalized to more complex network coding, secrecy under eavesdroppers, or adaptive communication settings (Milosavljevic et al., 2011).

7. Significance and Impact

AA-Omniscience integrates logical, information-theoretic, and statistical frameworks to deliver a comprehensive, interpretable assessment of what it means for a system to "know" and to reliably recognize its own ignorance or uncertainty. By providing a modular and extensible set of benchmarks—spanning modal logic task types, practical knowledge-centric evaluation, and foundational data-exchange protocols—AA-Omniscience underpins both principled reasoning system design and end-to-end measurement of deployed model reliability across economically and scientifically critical domains (Naumov et al., 2017, Jackson et al., 17 Nov 2025, Milosavljevic et al., 2011, Ding et al., 2016).

A plausible implication is that future omniscience benchmarks will continue to interleave logic, communication efficiency, and knowledge calibration, reflecting a multi-perspective synthesis essential for robust, trustworthy AI and distributed systems.