Block Policy Mirror Descent (BPMD)

Updated 5 December 2025

BPMD is a policy gradient algorithm that leverages block-coordinate mirror descent to update only sampled states, significantly lowering per-iteration computational and sample complexities.
The method employs diverse sampling schemes—including uniform, on-policy, and hybrid—to drive efficient updates and secure linear convergence under exploratory conditions.
Its stochastic extension (SBPMD) offers rollout-based estimates with provable sample complexity advantages over traditional batch policy gradient methods in large-scale MDPs.

Block Policy Mirror Descent (BPMD) is a class of policy gradient (PG) algorithms for regularized reinforcement learning (RL) that leverages block-coordinate mirror descent to efficiently solve large-scale Markov Decision Processes (MDPs). Unlike traditional batch PG methods that update the policy at all states simultaneously, BPMD operates via partial updates focused on sampled states, resulting in substantially lower per-iteration computational and sample complexity while retaining global optimality guarantees under convex regularization (Lan et al., 2022).

1. Problem Formulation and Regularization

BPMD addresses discounted, finite MDPs

$\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{P}, c, \gamma),$

where $\mathcal{S}$ is the finite state space, $\mathcal{A}$ the finite action space, $\mathcal{P}$ the transition kernel, $c$ the cost function, and $\gamma \in (0,1)$ the discount factor. Policies $\pi: \mathcal{S} \to \Delta_{\mathcal{A}}$ are mappings from states to distributions over actions.

Regularization is applied in a state-wise fashion:

$h^\pi(s)$

is a closed convex function of $\pi(\cdot | s)$ , and can be taken as strongly convex under KL-divergence if, for all $s$ , $h^\pi(s)$ is $\mu$ -strongly convex in $\pi(\cdot | s)$ . The per-state objective is

$Q^\pi(s,a) = c(s,a) + h^\pi(s) + \gamma \sum_{s'} \mathcal{P}(s'|s,a)\, V^\pi(s'),$

$V^\pi(s) = \sum_a \pi(a|s)\, Q^\pi(s,a),$

and the global objective is to solve

$\min_{\pi \in \Pi} f(\pi) = \mathbb{E}_{s \sim \nu^*}[V^\pi(s)], \tag{P}$

where $\nu^*$ is the stationary distribution under the optimal policy $\pi^*$ (Lan et al., 2022).

2. BPMD Algorithm

BPMD is rooted in block-coordinate mirror descent with KL-divergence as Bregman distance. Define

$w(x) = \sum_a x_a \ln x_a,$

with corresponding Bregman divergence,

$D_w(p, q) = w(p) - w(q) - \langle \nabla w(q),\, p - q \rangle = \mathrm{KL}(p \| q).$

At each iteration $k$ :

Sample $s_k \sim \rho_k$ , where $\rho_k$ is a possibly time-varying distribution over states.
Update only $\pi(\cdot|s_k)$ via the mirror descent step:

$\pi_{k+1}(\cdot|s_k) = \arg\min_{p \in \Delta_{\mathcal{A}}} \left\{ \eta_k \left[ \langle Q^{\pi_k}(s_k, \cdot), p \rangle + h^p(s_k) \right] + D_w(p, \pi_k(\cdot|s_k)) \right\},$

while $\pi_{k+1}(\cdot|s) = \pi_k(\cdot|s)$ for $s \neq s_k$ .

The update admits the closed form (via convex duality):

$\pi_{k+1}(\cdot|s_k) = \nabla w^*\left(\nabla w(\pi_k(\cdot|s_k)) - \eta_k [ Q^{\pi_k}(s_k, \cdot) + \partial h^{\pi_{k+1}}(s_k) ] \right),$

where $w^*$ is the convex conjugate of $w$ (Lan et al., 2022).

3. Sampling Schemes and Exploratory Distributions

BPMD allows for diverse state sampling policies $\rho_k$ during block updates:

Uniform sampling: $\rho_k(s) \equiv 1/|\mathcal{S}|$ .
On-policy sampling: $\rho_k$ matches the current policy's stationary distribution $\nu^{\pi_k}$ .
Hybrid schemes: Use an approximate $\tilde{\nu}$ for burn-in, then switch to uniform.

A distribution $\rho$ is \emph{exploratory} if

$\rho^\dagger = \min_{s:\, \nu^*(s)>0} \rho(s) > 0.$

Exploratory sampling is necessary and sufficient for linear convergence rates. On-policy sampling converges only if all $\rho_k$ remain exploratory w.p. 1. Hybrid schemes can provide instance-dependent acceleration by initially prioritizing “heavy” states, then switching to uniform to guarantee global convergence (Lan et al., 2022).

4. Convergence Theory

BPMD achieves provable convergence rates under both strongly convex and non-strongly convex regularization:

Strongly convex case ( $\mu > 0$ ): For fixed exploratory $\rho$ with $\rho^\dagger > 0$ and suitable step size $\eta$ , BPMD achieves

$\mathbb{E}[f(\pi_k) - f(\pi^*)] \leq \left(1-(1-\gamma)\rho^\dagger\right)^k C,$

where $C$ depends logarithmically on $|\mathcal{A}|$ and initial conditions. Linear rate is ensured if $1 + \eta\mu \geq 1/\gamma$ (Lan et al., 2022).

Non-strongly convex case ( $\mu=0$ ):
- Using exponentially increasing $\eta_k$ , linear convergence is retained.
- With constant $\eta$ , sublinear (i.e., $O(1/k)$ ) rates are obtained.

The analysis uses a generic recursion—summing over states and exploiting the properties of the Bregman divergence:

$\phi(\pi^*, \pi) = \sum_s \nu^*(s) D_{w}\left(\pi^*(\cdot|s), \pi(\cdot|s)\right).$

The contractive property of BPMD is established by a telescoping sum that mixes expected suboptimality and an averaged Bregman distance term. Hybrid sampling improves actual performance when $\nu^*$ is “light-tailed” by concentrating updates on frequent states early on (Lan et al., 2022).

5. Stochastic Extension (SBPMD)

When transition probabilities $\mathcal{P}$ are unavailable but a generative model is provided, BPMD extends to the stochastic setting (SBPMD) via rollout-based estimates:

$Q^{\pi_k, \xi_k, s_k}(s_k, a) = \sum_{t=0}^{T-1} \gamma^t (c(s_t, a_t) + h^{\pi_k}(s_t)),\quad (s_0, a_0) = (s_k, a),$

with samples generated by following current policy $\pi_k$ . For $T = O((1/(1-\gamma))\ln(1/\epsilon))$ , the estimator is unbiased up to $O(\epsilon)$ and has bounded variance.

Sample complexity results:

Strongly convex case: $\tilde{\mathcal{O}}(|\mathcal{S}|\,|\mathcal{A}|/\epsilon)$ samples are required to reach $\epsilon$ -optimality.
Non-strongly convex case: $\tilde{\mathcal{O}}(|\mathcal{S}|\,|\mathcal{A}|/\epsilon^2)$ samples suffice (Lan et al., 2022).

Step size selection for these regimes is $\eta_k \sim 1/k$ for the strongly convex case and $\eta_k \sim 1/\sqrt{k}$ for the non-strongly convex regime.

6. Computational Complexity and Comparison to Batch Policy Gradient Methods

BPMD achieves major computational advantages relative to batch PG methods:

Batch PG methods (PMD, NPG, etc.):
- Update policy at all states per iteration.
- Exact policy evaluation cost: $O(|\mathcal{S}|^3 + |\mathcal{S}|^2 |\mathcal{A}|)$ , improvement step $O(|\mathcal{S}| |\mathcal{A}|)$ , total per-iteration cost $O(|\mathcal{S}|^3+|\mathcal{S}|^2|\mathcal{A}|)$ .
BPMD:
- Updates a single policy block, requiring a rank-one inversion update, giving $O(|\mathcal{S}|^2 + |\mathcal{S}| |\mathcal{A}|)$ per iteration.
- Iteration complexity matches batch PG: $O\left(\frac{|\mathcal{S}|}{1-\gamma} \ln \frac{1}{\epsilon}\right)$ .
- Overall, BPMD delivers an $O(|\mathcal{S}|)$ -fold per-iteration speedup (Lan et al., 2022).

In stochastic settings, BPMD uses $O(|\mathcal{A}|)$ samples per iteration compared to $O(|\mathcal{S}||\mathcal{A}|)$ for batch PG, yielding another $|\mathcal{S}|$ -fold per-iteration speed advantage while maintaining identical sample complexity scaling in $\epsilon$ .

7. Practical Considerations and Implementation Insights

Key recommendations and properties for effective BPMD use include:

Granularity and early stopping: Each iteration affects only one or a small subset of states, allowing early interruption and resource-efficient operation.
Per-iteration efficiency: For large- $|\mathcal{S}|$ problems, the $O(|\mathcal{S}|^2 + |\mathcal{S}| |\mathcal{A}|)$ complexity is preferable to $O(|\mathcal{S}|^3)$ batch evaluation.
Parallelizable block updates: BPMD naturally extends to simultaneous updates at multiple states, maintaining convergence rates.
Sampling strategies: If an approximate visitation distribution $\tilde{\nu}$ is accessible, it is advantageous to sample according to $\tilde{\nu}$ early, switching to uniform sampling to assure optimality.
Regularizer selection: Negative-entropy regularization $h^\pi(s) = \sum_a \pi\ln \pi$ is compatible with KL-mirror descent, but other convex penalties are supported if they allow efficient mirror steps (Lan et al., 2022).

BPMD provides a block-coordinate mirror-descent framework for policy optimization in RL, achieving linear convergence under strong regularization, reduced computational and sample costs, and architectural flexibility for large-scale applications. The approach establishes a new perspective for block-coordinate descent in policy gradient optimization frameworks.

PDF Markdown Chat (Pro)

References (1)

Block Policy Mirror Descent (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Block Policy Mirror Descent (BPMD).