Restart for Posterior Sampling (RePS)

Updated 27 November 2025

RePS is a framework that augments posterior sampling with randomized restarts to efficiently explore complex, multimodal posteriors.
It is applied in reinforcement learning, diffusion-based inverse problems, and Bayesian nonparametrics, achieving error contraction and scalable computation.
By segmenting inference into pseudo-episodes and injecting noise, RePS balances exploration and estimation to deliver robust theoretical guarantees.

Restart for Posterior Sampling (RePS) refers to a class of techniques that augment posterior sampling by introducing randomized restart or resampling steps, enabling improved exploration, efficiency, and robustness in posterior inference across domains including reinforcement learning, Bayesian nonparametrics, and diffusion-based inverse problems. The central idea is to inject randomness into either the model-selection or optimization process, thereby partitioning computation into pseudo-episodes, contracting accumulated approximation errors, or traversing multiple posterior modes.

1. Key Principles of Restart-Based Posterior Sampling

RePS leverages randomization at key steps of the posterior sampling process to enhance the exploration of model space or optimization landscape. This is achieved by periodically resampling models, restarting optimization procedures, or injecting noise during sampling. The rationale is multifold:

Efficient exploration: Restarting or resampling prevents a single posterior sample or local optimizer from being stuck in a subspace, improving coverage in multimodal or complex posteriors.
Error contraction: In diffusion and gradient-based samplers, large injected noise at restart contracts errors that accumulate during deterministic integration.
Scalability: Many RePS variants permit embarrassingly parallel computation because restarts or pseudo-episodes are independent.

Table 1 summarizes the major RePS paradigms and their core mechanisms.

Application Domain	Restart Mechanism	Main Benefit
Reinforcement Learning	Bernoulli-coin model resampling	Bayesian regret reduction
Diffusion-Based Inverse Probs	ODE noise injection at checkpoints	Error contraction, memory efficiency
Bayesian Nonparametrics	Random restarts in optimization	Multimodal posterior exploration

2. RePS in Continuing Reinforcement Learning Environments

In the context of finite Markov Decision Processes (MDPs) with continuing agent–environment interfaces (i.e., no explicit episode boundaries), RePS is formalized as an extension of posterior sampling for reinforcement learning (PSRL), called continuing PSRL (CPSRL) (Xu et al., 2022). At its core, CPSRL maintains a Bayesian posterior over unknown MDP dynamics and follows the optimal $\gamma$ -discounted policy derived from a sampled model. With probability $1-\gamma$ at each time $t$ , a new model is sampled from the posterior, and the agent’s policy is recomputed accordingly; otherwise, the current policy is continued.

Each resampling event marks the beginning of a “pseudo-episode” of random geometric length $1/(1-\gamma)$ . The expected number of pseudo-episodes up to step $T$ is $(1-\gamma)T + 1$ . The $\gamma$ parameter, serving both as a planning horizon and restart rate control, is set optimally as $1-\gamma = \sqrt{SA/T}$ , where $S$ is number of states, $A$ the number of actions, and $T$ the time horizon. This choice achieves a tradeoff between estimation error (favoring fewer restarts) and planning bias (favoring more frequent resampling).

The Bayesian regret for the CPSRL algorithm, measured as the difference between the agent’s reward and the optimal average reward over $T$ steps, is bounded as

$\mathbb E\left[\mathrm{Regret}(T)\right] = \tilde O(\tau\,S\,\sqrt{A\,T})$

where $\tau$ denotes the reward-averaging time—the minimal uniform bound on the deviation of cumulative rewards from the average (Xu et al., 2022).

Intuitively, random resampling segments the infinite-horizon control problem into manageable pseudo-episodes and injects diversity into the exploration process. Posterior sampling lemmas, value decompositions via Bellman errors, and confidence set constructions underpin the proof of regret. This framework eliminates the need for complex or policy-dependent episode boundaries and naturally supports scaling (e.g., in Bootstrapped DQN) by introducing a coin-flip-based reset mechanism.

3. RePS in Diffusion-Based Inverse Problems

In inverse problems solved via diffusion models, RePS instantiates as a restart-based posterior sampling scheme that alternates between deterministic reverse ODE trajectories and periodic large-noise injections (Ahmed et al., 24 Nov 2025). This is motivated by the observation that pure ODE-based samplers accumulate modeling error from approximate score functions, while SDE-based samplers have higher discretization error but regularize via injected noise.

In this setting, the forward process is a variance-exploding SDE: $d x_t = \sigma'(t) d w_t$ with reverse-time sampling achieved via a measurement-conditioned ODE: $d x_t = -\sigma'(t)\frac{x_t - \mu_t(x_t, y)}{\sigma(t)}dt$ where $\mu_t(x_t, y)$ is the measurement-conditioned posterior mean, approximated by a MAP solve that balances fidelity to measurement $y$ and proximity to the unconditional denoising estimate.

RePS introduces a sequence of restart noise levels $\{\sigma_r\}_{r=0}^{R}$ interpolating between $\sigma_{\max}$ and $\sigma_{\min}$ . At each block, the ODE is solved via Euler steps; upon reaching $\sigma_{r+1}$ , noise is injected: $x \leftarrow x_{end} + \sqrt{\sigma_r^2 - \sigma_{r+1}^2}\,\xi\,,\quad \xi\sim\mathcal N(0,I)$ Contracting errors in this manner improves both convergence and reconstruction quality relative to vanilla ODE/SDE sampling, while also enabling substantial savings in memory and computation by avoiding back-propagation through the score network.

Unlike prior methods restricted to linear measurement operators or reliant on backpropagation, RePS accommodates arbitrary differentiable measurement models.

4. RePS in Nonparametric Bayesian Posterior Sampling

In Bayesian nonparametric inference, RePS denotes a “posterior bootstrap + random restart” scheme for exact sampling from nonparametric posteriors defined via Dirichlet process priors (Fong et al., 2019). The method proceeds as follows:

Posterior bootstrap: Sample a discrete data-generating distribution $F^*$ from the Dirichlet process posterior $DP(\alpha + n, G_n)$ , where $G_n$ is a convex mixture of prior and empirical distribution.
Perturbed objective minimization: For loss function $\ell(y, \theta)$ (typically negative log-likelihood), a single draw $\theta^*$ is obtained by solving

$\theta^* = \arg\min_\theta J^*(\theta),\quad J^*(\theta) = \sum_{i=1}^{n} w_i\,\ell(y_i, \theta) + \sum_{k=1}^{T}\,\hat{w}_k\,\ell(\hat{y}_k, \theta)$

with weights and pseudo-samples drawn from Dirichlet and prior distributions.

Random restarts: To address nonconvexity in $J^*(\theta)$ and multimodality in the posterior, RePS performs $R$ restarts per draw, initializing local optimizers with $\theta_{\text{init}}^{(r)}$ from a broad distribution $\pi_0$ , and selects the minimizer with the lowest objective value.

Each bootstrap–restart chain yields an independent, exact sample from the implied nonparametric posterior. As $R\to\infty$ , the probability of hitting all modes approaches 1 (assuming finitely many basins). The approach is highly parallelizable.

Empirical evaluations show that random restarts are critical for fully exploring multimodal posteriors, with performance closely matching MCMC (e.g., NUTS) in test log predictive density and often outperforming in computational efficiency and posterior sparsity (Fong et al., 2019).

5. Core Quantities and Theoretical Guarantees

Several theoretical constructs underpin RePS methodologies:

Reward averaging time $\tau$ (reinforcement learning): Uniform mixing/transient bound on cumulative reward under any policy; crucial for regret analysis (Xu et al., 2022).
Posterior bootstrap exactness: For Dirichlet process priors, the law of $\theta^* = \arg\min J^*(\theta)$ is exact for the nonparametric posterior, with asymptotic consistency as $n\to\infty$ .
Restart pseudo-episode analysis: Geometric pseudo-episode lengths allow concentration arguments analogous to classical episode-based analysis.
Error contraction in diffusion: Large injected Gaussian noise at restart checkpoints contracts modeling error, with ODE blocks in-between minimizing discretization error (Ahmed et al., 24 Nov 2025).

The regret bound for CPSRL is

$\mathbb E\left[\mathrm{Regret}(T)\right] = \tilde O(\tau\,S\,\sqrt{A\,T})$

provided $\gamma$ is set optimally relative to $T, S, A$ . In nonparametric settings, the posterior sampling is exact modulo optimization, with the random restarts guaranteeing exploration of all local minima of $J^*$ .

6. Computational Considerations and Scalability

RePS designs typically admit efficient and parallelizable implementations:

RL/MDP setting: The single Bernoulli coin-flip restart eliminates per-state counters or episode-termination rules, scaling directly to deep RL setups such as Bootstrapped DQN.
Diffusion inverse problems: No gradient backpropagation through the score model is required during MAP solves; memory is reduced by ~50% and larger batch sizes become feasible. Empirically, RePS matches or exceeds the efficiency of alternative diffusion-based samplers at comparable numerical fidelity (Ahmed et al., 24 Nov 2025).
Nonparametric sampling: Each bootstrap–restart path is independent; perfect parallel scaling is achievable with minimal communication—only the final minimizer is required per chain. Total computational complexity is $O(B\,R\,(n+T)N_{\text{iter}})$ for $B$ bootstrap samples, $R$ restarts, and $N_{\text{iter}}$ optimizer steps.

In all domains, the separation of restart events from model or data structures enhances modularity and enables practical system implementations for high-dimensional or nonconvex inference tasks.

7. Intuitive Insights and Domain-Specific Impact

The shared motif across all RePS variants is that randomized restart or resampling partitions posterior inference into independently manageable blocks, ensuring effective exploration, robust error contraction, and adaptation to practical constraints such as misspecification and multimodality. In RL, the resampling pseudo-episodes mimic classical episodic learning without necessitating explicit environment resets and deliver regret comparable to or improving upon previous PSRL approaches. In inverse problems, RePS enables efficient posterior inference when existing diffusion-based methods are either inapplicable or computationally prohibitive. Nonparametric variants robustly track multiple predictive patterns or shrinkage regimes inaccessible to conventional bootstrap or variational inference.

A plausible implication is that the RePS framework provides a principled foundation for scalable, reliable posterior sampling across heterogeneous modeling domains, particularly when the inference target is complex, multimodal, or misspecified. By reducing the reliance on bespoke episodic or sampling heuristics and by leveraging randomization to induce exploration, RePS-type methods are likely to remain central to the ongoing development of robust Bayesian computation and exploration-driven learning.

Relevant foundational works include "Posterior Sampling for Continuing Environments" (Xu et al., 2022), "Solving Diffusion Inverse Problems with Restart Posterior Sampling" (Ahmed et al., 24 Nov 2025), and "Scalable Nonparametric Sampling from Multimodal Posteriors with the Posterior Bootstrap" (Fong et al., 2019).