B-PAC Reasoning: Safety & Efficiency

Updated 6 February 2026

B-PAC reasoning is an online decision-making framework that combines inverse propensity scoring, betting martingales, and anytime-valid PAC guarantees to ensure safety and efficiency.
It dynamically adjusts the use of high-cost models based on a statistical betting process that monitors uncertainty via thresholding to control risk.
Empirical results demonstrate that B-PAC meets stringent performance guarantees and substantially reduces expensive model invocations in non-stationary environments.

Betting Probably Approximately Correct (B-PAC) reasoning refers to a class of online decision-making frameworks that certify performance to within a user-specified tolerance (ε) with high probability (1−α), while optimally minimizing the invocation of an expensive, high-accuracy “thinking model.” The B-PAC paradigm integrates sequential importance sampling, statistical betting martingales, and anytime-valid PAC (Probably Approximately Correct) guarantees. It addresses scenarios where only partial feedback on errors is available and the system must remain robust to non-stationary query streams, yielding practical and theoretically-sound online reasoning mechanisms for large reasoning models and boundedly rational agents (Yu et al., 30 Jan 2026, Oesterheld et al., 2023).

1. Problem Setting: Online Reasoning under Partial Feedback

B-PAC reasoning is motivated by the deployment of two black-box models: a high-accuracy, high-cost “thinking model” $f:\mathcal{X}\to\mathcal{Y}$ , and a low-cost, lower-accuracy “non-thinking model” $f':\mathcal{X}\to\mathcal{Y}$ . For each time $t$ , a query $X_t$ is drawn from a possibly non-stationary stream. The output $f(X_t)$ is available only when the expensive model is invoked. The decision $f_t(X_t)\in\{f'(X_t),f(X_t)\}$ is determined online by a composite policy.

Crucially, the true response $Y_t$ is never directly observed; performance loss is measured relative to $f(X_t)$ as $\ell(f_t(X),f(X)) \in [0,1]$ , where $\ell$ is a loss function. The population risk at time $t$ under a thresholding policy is defined as

$R_t(f_t) = \mathbb{E}_X[\ell(f_t(X), f(X))].$

The key objective is to guarantee, with high probability $\geq 1-\alpha$ , that $R_t(f_t) < \epsilon$ for all $t$ while minimizing invocations of the expensive model.

2. B-PAC Objective: Simultaneous Safety and Efficiency

The dual mandate of B-PAC reasoning is:

Safety: Enforce the uniform bound $R_t(f_t) \leq \epsilon$ at all times $t$ with probability $\geq 1-\alpha$ (the “anytime (ε,α)-PAC efficiency guarantee”).
Efficiency: Minimize the cumulative fraction of queries routed to $f$ , measured as

$\mathrm{ECP}_t = \frac{1}{t} \sum_{i=1}^t \mathbb{I}\{\text{call } f \text{ at } i\},$

or the equivalent token-level savings.

To achieve both objectives, B-PAC dynamically adapts a threshold $\hat{u}_t$ on a scalar uncertainty score $U_t$ for $f'$ . If $U_t < \hat{u}_t$ , $f'$ is used; otherwise, $f$ is invoked. Raising $\hat{u}_t$ expands the usage of $f'$ , but only when statistical evidence, quantified via a betting process, suffices to certify the performance loss remains within $\epsilon$ (Yu et al., 30 Jan 2026).

3. Core Methodology: IPS Risk Estimation and Betting Martingales

The system must estimate the risk of delegating to $f'$ under selective feedback. Since $\ell(f'(X_t),f(X_t))$ is observed only when $f$ is called, B-PAC employs an inverse propensity scoring (IPS) estimator to correct selection bias. For any threshold $u$ , the estimator is:

$Z_t(u) = (1 - p_{\min})\sum_{i=1}^t \frac{\mathbb{I}\{U_i < u\}\,\mathbb{I}\{\theta_i = 1\}}{T_i},$

where $T_i$ is the sampling probability (which depends on $U_i$ and previous thresholds), $\theta_i \sim \mathrm{Bernoulli}(T_i)$ indicates whether $f$ was called, and $p_{\min} = \inf_t p_t$ is the minimal exploration rate.

A “test supermartingale” is constructed as a wealth process $K_t(u)$ . At each round, a nonnegative, predictable bet $A_t(u)$ is placed, and

$K_t(u) = K_{t-1}(u)\bigl(1 + A_t(u) D_t(u)\bigr)$

with $D_t(u) = \epsilon - Z_t(u)$ . Under the null hypothesis that the true risk at threshold $u$ exceeds $\epsilon$ , $K_t(u)$ forms a supermartingale, ensuring anytime-valid control via Ville’s inequality (Yu et al., 30 Jan 2026).

4. Thresholding Algorithm and Theoretical Guarantees

A finite grid of candidate thresholds $U = \{u^{(1)},\ldots,u^{(N)}\}$ is maintained. At each round $t$ :

Compute $f'(X_t)$ and $U_t$ .
With probability $T_t$ (determined by comparison to $\hat{u}_{t-1}$ and the exploration parameter $p_t$ ), decide whether to call $f$ (i.e., $\theta_t \sim \mathrm{Bernoulli}(T_t)$ ).
Update IPS estimator $Z_t(u)$ , betting fraction $A_t(u)$ , and wealth $K_t(u)$ for each $u$ .
Set $\hat{u}_t$ to the largest $u^{(i)}$ for which $K_t(u^{(j)}) \geq 1/\alpha$ for all $j \leq i$ ; otherwise, set $\hat{u}_t=0$ .

The principal safety guarantee is: with probability $\geq1-\alpha$ , for all $t$ , $R_t(\hat{u}_t) \leq \epsilon$ under i.i.d. queries or certain non-stationary regimes. The adaptive betting strategy, based on online convex optimization (using a projected fraction $A_t(u)$ incorporating a second-order Taylor surrogate), ensures logarithmic regret $O(\log T)$ to the best fixed threshold, enabling rapid convergence to optimal efficiency (Yu et al., 30 Jan 2026).

5. Connections to Bounded Inductive Rationality and Hypothesis Betting

B-PAC style reasoning generalizes to settings explored in the theory of bounded inductive rationality (Oesterheld et al., 2023). In this broader context, boundedly rational inductive agents (BRIAs) sequentially choose among a finite set of options $\mathcal{DP}_t$ each round, using an internal “betting” process over a countable family of hypotheses $\mathcal{H}$ with computable recommendations and reward promises. BRIAs maintain a policy that never overestimates rewards and test each hypothesis infinitely often. The agent’s “wealth” for each hypothesis is updated via allowance schedules and observed payoffs. Coverage of hypotheses via statistical tests ensures that if any hypothesis promises consistently higher rewards, it is followed sufficiently often.

The general PAC-like guarantee in this setting asserts that, when a valid, high-reward hypothesis exists, the agent’s empirical mean reward converges with high probability to within $\epsilon$ of the best-expected reward, provided payoffs along the hypothesis’ trajectory are sufficiently random (e.g., van-Mises–Wald–Church or Schnorr randomness). In repeated games, pairs of BRIAs can implement any strictly individually rational correlated equilibrium, with empirical plays converging accordingly (Oesterheld et al., 2023).

6. Implementation and Empirical Results

B-PAC reasoning incurs negligible computational overhead compared to large model inference: $O(N)$ per-round updates for a threshold grid of $N \approx 1000$ is typical. Exploration probabilities $p_t$ are managed by a two-stage schedule: a warm-up phase with $p_{\text{warm}} \approx 0.7$ for the initial $T_{\text{warm}} \approx 200$ rounds, then $p_{\text{deploy}} \approx 0.05$ for steady-state operation. Asynchronous or sharded implementations of the betting martingale allow for high-throughput deployment (Yu et al., 30 Jan 2026).

Empirical benchmarks on MATH, MMLU-Pro, BIG-Bench Hard (BBH), and Magpie demonstrate substantial computational savings with stringent safety. For example, on Magpie with parameters $\epsilon=0.08$ , $\alpha=0.1$ , B-PAC maintains empirical loss $\mathrm{ER} \approx 0.03 < 0.08$ throughout, invokes the expensive model on only $19\%$ of queries, and reserves $59\%$ of tokens for it, outperforming offline PAC thresholds and ablation baselines that either violate safety or are overly conservative. Under non-stationary drifts, B-PAC adaptively tightens the threshold to preserve the safety guarantee, in contrast to fixed offline PAC (Yu et al., 30 Jan 2026).

7. Comparative Analysis, Limitations, and Significance

B-PAC’s integration of IPS estimation, betting supermartingales, and dynamic thresholding provides anytime-valid, model-agnostic, online control of reasoning risks—substantially generalizing classical PAC approaches and outperforming heuristic routing or simple union-bound-based methods. Naive estimators tend to violate safety or incur prohibitive expert-call rates ( $\approx 100\%$ ). Classical Hoeffding-style methods are conservative, achieving safety at the expense of efficiency. Heuristic policies such as Chain-of-Draft or NoThinking do not reliably control loss.

A plausible implication is that B-PAC offers a broadly applicable, robust strategy for AI reasoning architectures under uncertainty and limited feedback, and connects deeply to theories of inductive rationality in both single-agent and multi-agent settings. The folk-theorem for BRIAs reinforces that B-PAC-type “betting” approaches are not artifacts of black-box model selection, but arise naturally in settings where optimality and bounded rationality must be simultaneously addressed (Oesterheld et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

Anytime Safe PAC Efficient Reasoning (2026)

A Theory of Bounded Inductive Rationality (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Betting Probably Approximately Correct (B-PAC) Reasoning.