Normalization of reflective-oracle–completed weights for Thompson sampling

Determine whether directly completing lower-semicomputable posterior weights w(ρ | h_{<t}) via a reflective oracle—by treating each weight generator as an oracle probabilistic Turing machine that outputs 1 with probability w(ρ | h_{<t}) and otherwise fails to halt—produces a normalized set of weights that sum to one, thereby enabling the stepwise Thompson sampling policy π_T over the reflective-oracle–computable environment class Mrefl to be defined when the weights are only lower semicomputable and potentially defective.

Background

Thompson sampling in this paper requires a posterior over environments (or games and opponent strategies) with normalized weights that sum to one. When the posterior weights are merely lower semicomputable and defective (i.e., their sum is less than one), the algorithm may fail to sample an environment, making the policy under-specified. The authors consider using reflective oracles to complete such weights, analogous to how reflective oracles complete semimeasures to measures for policies and environments.

However, while completing the mixture environment ξ is straightforward, Thompson sampling needs access to explicit normalized coefficients. The proposal to directly complete each individual weight using a reflective oracle raises a technical uncertainty: whether the completed weights still form a normalized distribution, which is essential for defining the Thompson sampling strategy π_T within the reflective-oracle framework.

References

Generalizing to l.s.c. weights, it is natural to try to use the reflective oracle to somehow complete Thompson sampling. We could try to complete \pi_T's environment mixture \xi. Unfortunately this would not explicitly complete the weights which Thompson sampling needs access to; \pi_T requires not a dominant environment but explicit coefficients. The reflective oracle could be used to directly complete each weight from an oracle pTM generating it (in the sense of outputting 1 with probability w(\rho | _{<t}) and otherwise failing to halt) but it is unclear whether the individually completed weights would still sum to 1.

Limit-Computable Grains of Truth for Arbitrary Computable Extensive-Form (Un)Known Games (2508.16245 - Wyeth et al., 22 Aug 2025) in Section 7, Asymptotic Optimality in Unknown Games, paragraph “(Un)normalized weights w”