Papers
Topics
Authors
Recent
2000 character limit reached

Block Policy Mirror Descent (BPMD)

Updated 5 December 2025
  • BPMD is a policy gradient algorithm that leverages block-coordinate mirror descent to update only sampled states, significantly lowering per-iteration computational and sample complexities.
  • The method employs diverse sampling schemes—including uniform, on-policy, and hybrid—to drive efficient updates and secure linear convergence under exploratory conditions.
  • Its stochastic extension (SBPMD) offers rollout-based estimates with provable sample complexity advantages over traditional batch policy gradient methods in large-scale MDPs.

Block Policy Mirror Descent (BPMD) is a class of policy gradient (PG) algorithms for regularized reinforcement learning (RL) that leverages block-coordinate mirror descent to efficiently solve large-scale Markov Decision Processes (MDPs). Unlike traditional batch PG methods that update the policy at all states simultaneously, BPMD operates via partial updates focused on sampled states, resulting in substantially lower per-iteration computational and sample complexity while retaining global optimality guarantees under convex regularization (Lan et al., 2022).

1. Problem Formulation and Regularization

BPMD addresses discounted, finite MDPs

M=(S,A,P,c,γ),\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{P}, c, \gamma),

where S\mathcal{S} is the finite state space, A\mathcal{A} the finite action space, P\mathcal{P} the transition kernel, cc the cost function, and γ(0,1)\gamma \in (0,1) the discount factor. Policies π:SΔA\pi: \mathcal{S} \to \Delta_{\mathcal{A}} are mappings from states to distributions over actions.

Regularization is applied in a state-wise fashion:

hπ(s)h^\pi(s)

is a closed convex function of π(s)\pi(\cdot | s), and can be taken as strongly convex under KL-divergence if, for all ss, hπ(s)h^\pi(s) is μ\mu-strongly convex in π(s)\pi(\cdot | s). The per-state objective is

Qπ(s,a)=c(s,a)+hπ(s)+γsP(ss,a)Vπ(s),Q^\pi(s,a) = c(s,a) + h^\pi(s) + \gamma \sum_{s'} \mathcal{P}(s'|s,a)\, V^\pi(s'),

Vπ(s)=aπ(as)Qπ(s,a),V^\pi(s) = \sum_a \pi(a|s)\, Q^\pi(s,a),

and the global objective is to solve

minπΠf(π)=Esν[Vπ(s)],(P)\min_{\pi \in \Pi} f(\pi) = \mathbb{E}_{s \sim \nu^*}[V^\pi(s)], \tag{P}

where ν\nu^* is the stationary distribution under the optimal policy π\pi^* (Lan et al., 2022).

2. BPMD Algorithm

BPMD is rooted in block-coordinate mirror descent with KL-divergence as Bregman distance. Define

w(x)=axalnxa,w(x) = \sum_a x_a \ln x_a,

with corresponding Bregman divergence,

Dw(p,q)=w(p)w(q)w(q),pq=KL(pq).D_w(p, q) = w(p) - w(q) - \langle \nabla w(q),\, p - q \rangle = \mathrm{KL}(p \| q).

At each iteration kk:

  1. Sample skρks_k \sim \rho_k, where ρk\rho_k is a possibly time-varying distribution over states.
  2. Update only π(sk)\pi(\cdot|s_k) via the mirror descent step:

πk+1(sk)=argminpΔA{ηk[Qπk(sk,),p+hp(sk)]+Dw(p,πk(sk))},\pi_{k+1}(\cdot|s_k) = \arg\min_{p \in \Delta_{\mathcal{A}}} \left\{ \eta_k \left[ \langle Q^{\pi_k}(s_k, \cdot), p \rangle + h^p(s_k) \right] + D_w(p, \pi_k(\cdot|s_k)) \right\},

while πk+1(s)=πk(s)\pi_{k+1}(\cdot|s) = \pi_k(\cdot|s) for ssks \neq s_k.

The update admits the closed form (via convex duality):

πk+1(sk)=w(w(πk(sk))ηk[Qπk(sk,)+hπk+1(sk)]),\pi_{k+1}(\cdot|s_k) = \nabla w^*\left(\nabla w(\pi_k(\cdot|s_k)) - \eta_k [ Q^{\pi_k}(s_k, \cdot) + \partial h^{\pi_{k+1}}(s_k) ] \right),

where ww^* is the convex conjugate of ww (Lan et al., 2022).

3. Sampling Schemes and Exploratory Distributions

BPMD allows for diverse state sampling policies ρk\rho_k during block updates:

  • Uniform sampling: ρk(s)1/S\rho_k(s) \equiv 1/|\mathcal{S}|.
  • On-policy sampling: ρk\rho_k matches the current policy's stationary distribution νπk\nu^{\pi_k}.
  • Hybrid schemes: Use an approximate ν~\tilde{\nu} for burn-in, then switch to uniform.

A distribution ρ\rho is \emph{exploratory} if

ρ=mins:ν(s)>0ρ(s)>0.\rho^\dagger = \min_{s:\, \nu^*(s)>0} \rho(s) > 0.

Exploratory sampling is necessary and sufficient for linear convergence rates. On-policy sampling converges only if all ρk\rho_k remain exploratory w.p. 1. Hybrid schemes can provide instance-dependent acceleration by initially prioritizing “heavy” states, then switching to uniform to guarantee global convergence (Lan et al., 2022).

4. Convergence Theory

BPMD achieves provable convergence rates under both strongly convex and non-strongly convex regularization:

  • Strongly convex case (μ>0\mu > 0): For fixed exploratory ρ\rho with ρ>0\rho^\dagger > 0 and suitable step size η\eta, BPMD achieves

E[f(πk)f(π)](1(1γ)ρ)kC,\mathbb{E}[f(\pi_k) - f(\pi^*)] \leq \left(1-(1-\gamma)\rho^\dagger\right)^k C,

where CC depends logarithmically on A|\mathcal{A}| and initial conditions. Linear rate is ensured if 1+ημ1/γ1 + \eta\mu \geq 1/\gamma (Lan et al., 2022).

  • Non-strongly convex case (μ=0\mu=0):
    • Using exponentially increasing ηk\eta_k, linear convergence is retained.
    • With constant η\eta, sublinear (i.e., O(1/k)O(1/k)) rates are obtained.

The analysis uses a generic recursion—summing over states and exploiting the properties of the Bregman divergence:

ϕ(π,π)=sν(s)Dw(π(s),π(s)).\phi(\pi^*, \pi) = \sum_s \nu^*(s) D_{w}\left(\pi^*(\cdot|s), \pi(\cdot|s)\right).

The contractive property of BPMD is established by a telescoping sum that mixes expected suboptimality and an averaged Bregman distance term. Hybrid sampling improves actual performance when ν\nu^* is “light-tailed” by concentrating updates on frequent states early on (Lan et al., 2022).

5. Stochastic Extension (SBPMD)

When transition probabilities P\mathcal{P} are unavailable but a generative model is provided, BPMD extends to the stochastic setting (SBPMD) via rollout-based estimates:

Qπk,ξk,sk(sk,a)=t=0T1γt(c(st,at)+hπk(st)),(s0,a0)=(sk,a),Q^{\pi_k, \xi_k, s_k}(s_k, a) = \sum_{t=0}^{T-1} \gamma^t (c(s_t, a_t) + h^{\pi_k}(s_t)),\quad (s_0, a_0) = (s_k, a),

with samples generated by following current policy πk\pi_k. For T=O((1/(1γ))ln(1/ϵ))T = O((1/(1-\gamma))\ln(1/\epsilon)), the estimator is unbiased up to O(ϵ)O(\epsilon) and has bounded variance.

Sample complexity results:

  • Strongly convex case: O~(SA/ϵ)\tilde{\mathcal{O}}(|\mathcal{S}|\,|\mathcal{A}|/\epsilon) samples are required to reach ϵ\epsilon-optimality.
  • Non-strongly convex case: O~(SA/ϵ2)\tilde{\mathcal{O}}(|\mathcal{S}|\,|\mathcal{A}|/\epsilon^2) samples suffice (Lan et al., 2022).

Step size selection for these regimes is ηk1/k\eta_k \sim 1/k for the strongly convex case and ηk1/k\eta_k \sim 1/\sqrt{k} for the non-strongly convex regime.

6. Computational Complexity and Comparison to Batch Policy Gradient Methods

BPMD achieves major computational advantages relative to batch PG methods:

  • Batch PG methods (PMD, NPG, etc.):
    • Update policy at all states per iteration.
    • Exact policy evaluation cost: O(S3+S2A)O(|\mathcal{S}|^3 + |\mathcal{S}|^2 |\mathcal{A}|), improvement step O(SA)O(|\mathcal{S}| |\mathcal{A}|), total per-iteration cost O(S3+S2A)O(|\mathcal{S}|^3+|\mathcal{S}|^2|\mathcal{A}|).
  • BPMD:
    • Updates a single policy block, requiring a rank-one inversion update, giving O(S2+SA)O(|\mathcal{S}|^2 + |\mathcal{S}| |\mathcal{A}|) per iteration.
    • Iteration complexity matches batch PG: O(S1γln1ϵ)O\left(\frac{|\mathcal{S}|}{1-\gamma} \ln \frac{1}{\epsilon}\right).
    • Overall, BPMD delivers an O(S)O(|\mathcal{S}|)-fold per-iteration speedup (Lan et al., 2022).

In stochastic settings, BPMD uses O(A)O(|\mathcal{A}|) samples per iteration compared to O(SA)O(|\mathcal{S}||\mathcal{A}|) for batch PG, yielding another S|\mathcal{S}|-fold per-iteration speed advantage while maintaining identical sample complexity scaling in ϵ\epsilon.

7. Practical Considerations and Implementation Insights

Key recommendations and properties for effective BPMD use include:

  • Granularity and early stopping: Each iteration affects only one or a small subset of states, allowing early interruption and resource-efficient operation.
  • Per-iteration efficiency: For large-S|\mathcal{S}| problems, the O(S2+SA)O(|\mathcal{S}|^2 + |\mathcal{S}| |\mathcal{A}|) complexity is preferable to O(S3)O(|\mathcal{S}|^3) batch evaluation.
  • Parallelizable block updates: BPMD naturally extends to simultaneous updates at multiple states, maintaining convergence rates.
  • Sampling strategies: If an approximate visitation distribution ν~\tilde{\nu} is accessible, it is advantageous to sample according to ν~\tilde{\nu} early, switching to uniform sampling to assure optimality.
  • Regularizer selection: Negative-entropy regularization hπ(s)=aπlnπh^\pi(s) = \sum_a \pi\ln \pi is compatible with KL-mirror descent, but other convex penalties are supported if they allow efficient mirror steps (Lan et al., 2022).

BPMD provides a block-coordinate mirror-descent framework for policy optimization in RL, achieving linear convergence under strong regularization, reduced computational and sample costs, and architectural flexibility for large-scale applications. The approach establishes a new perspective for block-coordinate descent in policy gradient optimization frameworks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Block Policy Mirror Descent (BPMD).