- The paper presents a formal solution to the grain of truth problem by constructing reflective-oracle computable strategies that include all computable and Bayes-optimal policies.
- It leverages reflective oracles to overcome recursive reasoning challenges, ensuring convergence to ε-Nash equilibria in both known and unknown computable games.
- The approach is practical with effective enumerability and limit-computability, paving the way for self-predictive agents and advanced multi-agent learning frameworks.
This paper addresses the grain of truth problem in Bayesian multi-agent learning within arbitrary computable extensive-form games. The grain of truth problem, originally posed by Kalai and Lehrer, asks whether it is possible to construct a sufficiently rich class of strategies such that every Bayes-optimal policy (with respect to this class) is itself contained within the class, allowing for mutually consistent beliefs among Bayesian agents. Previous work established only limited classes with this property, and several impossibility results suggested that the problem is intractable for general strategy classes.
The authors present a formal solution by constructing a class of reflective-oracle computable strategies (Prefl) that includes all computable strategies and Bayes-optimal strategies for any reasonable prior. This construction leverages reflective oracles to resolve the infinite regress of agents reasoning about each other's reasoning, enabling consistent Bayesian inference in multi-agent settings.
Reflective Oracles and Computability Foundations
Reflective oracles are central to the solution. They allow probabilistic Turing machines (pTMs) to query about their own behavior, circumventing diagonalization barriers and enabling self-referential reasoning. The paper extends the definition of reflective oracles to non-binary alphabets and introduces typed oracles to handle distinct action and percept spaces.
Key computability notions are formalized:
- Limit-computable functions: Approximable to arbitrary but unknown precision.
- Lower semicomputable (l.s.c.) functions: Approximable from below.
- Estimable functions: Approximable to arbitrary pre-specified precision.
The authors prove that for any pTM, the induced semimeasure is l.s.c., and conversely, any l.s.c. semimeasure can be sampled by a pTM. With reflective oracle access, these results generalize to O-sampled and O-estimable semimeasures, forming the basis for the strategy class Prefl.
Multi-Agent Game Model and Strategy Class Construction
The paper formalizes multi-agent games as functions mapping histories and action profiles to distributions over percepts. Each agent's strategy is a mapping from its local history to a distribution over actions. The subjective environment for each agent is defined by marginalizing over the other agents' actions and percepts.
The strategy class Prefl consists of all reflective-oracle computable strategies. The authors show that Prefl is effectively enumerable and contains a dominant mixture policy ζ that multiplicatively dominates all other strategies in the class. This dominance property is crucial for establishing the grain of truth property.
Existence of Reflective-Oracle Computable Nash Equilibria
The authors construct Nash equilibria in the class of reflective-oracle computable strategies. For any computable multi-agent game, they show that mutually optimal response strategies exist and are reflective-oracle computable. The construction uses Kleene's second recursion theorem to resolve the circular dependencies among agents' strategies.
The value function for each agent is defined as the expected sum of discounted rewards, and optimal strategies are constructed via reflective-oracle guided maximization. The Nash equilibrium obtained is subgame perfect, as agents act optimally even on histories they play with zero probability.
Convergence of Bayesian Agents and Grain of Truth Property
The paper proves that Bayesian agents with priors supported on Prefl converge to ε-Nash equilibria in infinitely repeated computable games. The construction of the dominant mixture policy ζ ensures that every Bayes-optimal strategy is assigned nonzero probability, satisfying the grain of truth property.
For unknown games, the authors extend the analysis to Thompson sampling strategies, showing that agents using limit-computable Thompson sampling policies converge to ε-Nash equilibria in arbitrary unknown computable multi-agent environments. The strong grain of truth property is established for the class Prefl and the corresponding environment class Mrefl.
Impossibility Results and Avoidance
The paper discusses classical impossibility results (Nachbar, Foster & Young) and demonstrates that the constructed class Prefl avoids these by violating the purity condition—no deterministic policy in Prefl can always take an action that a stochastic policy assigns positive probability to. The countability and computability constraints on Prefl prevent the pathologies that lead to impossibility in uncountable or unrestricted strategy classes.
Asymptotic Optimality in Unknown Games
The authors generalize the convergence results to settings where agents are not initially aware of the game or the existence of other agents. By considering the class of reflective-oracle computable environments (Mrefl), they show that asymptotically optimal policies (in mean) converge to ε-Nash equilibria. Thompson sampling is shown to be reflective-oracle computable under estimable priors, and the limit-computability of reflective oracles ensures practical approximability.
Application to Self-Predictive Agents
A novel application is presented: the construction of self-predictive agents (Self-AIXI) that maintain consistent beliefs about their own future policy. The machinery developed for the grain of truth problem enables the definition of a stochastic self-predictive policy within Prefl, providing a principled alternative to planning-based RL agents.
Implementation Considerations
- Enumerability: Both Prefl and Mrefl are effectively enumerable, allowing for practical implementation of Bayesian mixtures and Thompson sampling.
- Limit-computability: All key constructions (reflective oracles, mixture policies, optimal strategies) are limit-computable, enabling arbitrary precision approximation.
- Typed Oracles: The extension to non-binary and typed reflective oracles supports heterogeneous action and percept spaces, facilitating deployment in complex multi-agent systems.
- Resource Requirements: The limit-computable algorithms are feasible for implementation on real-world hardware, subject to computational resource constraints inherent in Turing machine simulation and oracle approximation.
Implications and Future Directions
The results provide a rigorous foundation for Bayesian learning in general multi-agent environments, justifying the emergence of Nash equilibria from rational learning dynamics. The framework supports both known and unknown games, and the use of reflective oracles resolves longstanding issues in recursive reasoning and self-prediction.
Future research directions include:
- Characterizing the centrality and uniqueness of reflective oracles among solutions to the grain of truth problem.
- Investigating the intersection and union of Prefl across different reflective oracles.
- Exploring the practical deployment of reflective-oracle based agents in real-world multi-agent systems, including human-computer interaction and cooperative AI architectures.
Conclusion
The paper provides a comprehensive solution to the grain of truth problem for arbitrary computable extensive-form games, constructing a limit-computable class of strategies and environments that supports consistent Bayesian learning and convergence to Nash equilibria. The use of reflective oracles enables principled recursive reasoning and self-prediction, with broad implications for the theory and practice of multi-agent reinforcement learning and game theory.