Block Policy Mirror Descent (BPMD)
- BPMD is a policy gradient algorithm that leverages block-coordinate mirror descent to update only sampled states, significantly lowering per-iteration computational and sample complexities.
- The method employs diverse sampling schemes—including uniform, on-policy, and hybrid—to drive efficient updates and secure linear convergence under exploratory conditions.
- Its stochastic extension (SBPMD) offers rollout-based estimates with provable sample complexity advantages over traditional batch policy gradient methods in large-scale MDPs.
Block Policy Mirror Descent (BPMD) is a class of policy gradient (PG) algorithms for regularized reinforcement learning (RL) that leverages block-coordinate mirror descent to efficiently solve large-scale Markov Decision Processes (MDPs). Unlike traditional batch PG methods that update the policy at all states simultaneously, BPMD operates via partial updates focused on sampled states, resulting in substantially lower per-iteration computational and sample complexity while retaining global optimality guarantees under convex regularization (Lan et al., 2022).
1. Problem Formulation and Regularization
BPMD addresses discounted, finite MDPs
where is the finite state space, the finite action space, the transition kernel, the cost function, and the discount factor. Policies are mappings from states to distributions over actions.
Regularization is applied in a state-wise fashion:
is a closed convex function of , and can be taken as strongly convex under KL-divergence if, for all , is -strongly convex in . The per-state objective is
and the global objective is to solve
where is the stationary distribution under the optimal policy (Lan et al., 2022).
2. BPMD Algorithm
BPMD is rooted in block-coordinate mirror descent with KL-divergence as Bregman distance. Define
with corresponding Bregman divergence,
At each iteration :
- Sample , where is a possibly time-varying distribution over states.
- Update only via the mirror descent step:
while for .
The update admits the closed form (via convex duality):
where is the convex conjugate of (Lan et al., 2022).
3. Sampling Schemes and Exploratory Distributions
BPMD allows for diverse state sampling policies during block updates:
- Uniform sampling: .
- On-policy sampling: matches the current policy's stationary distribution .
- Hybrid schemes: Use an approximate for burn-in, then switch to uniform.
A distribution is \emph{exploratory} if
Exploratory sampling is necessary and sufficient for linear convergence rates. On-policy sampling converges only if all remain exploratory w.p. 1. Hybrid schemes can provide instance-dependent acceleration by initially prioritizing “heavy” states, then switching to uniform to guarantee global convergence (Lan et al., 2022).
4. Convergence Theory
BPMD achieves provable convergence rates under both strongly convex and non-strongly convex regularization:
- Strongly convex case (): For fixed exploratory with and suitable step size , BPMD achieves
where depends logarithmically on and initial conditions. Linear rate is ensured if (Lan et al., 2022).
- Non-strongly convex case ():
- Using exponentially increasing , linear convergence is retained.
- With constant , sublinear (i.e., ) rates are obtained.
The analysis uses a generic recursion—summing over states and exploiting the properties of the Bregman divergence:
The contractive property of BPMD is established by a telescoping sum that mixes expected suboptimality and an averaged Bregman distance term. Hybrid sampling improves actual performance when is “light-tailed” by concentrating updates on frequent states early on (Lan et al., 2022).
5. Stochastic Extension (SBPMD)
When transition probabilities are unavailable but a generative model is provided, BPMD extends to the stochastic setting (SBPMD) via rollout-based estimates:
with samples generated by following current policy . For , the estimator is unbiased up to and has bounded variance.
Sample complexity results:
- Strongly convex case: samples are required to reach -optimality.
- Non-strongly convex case: samples suffice (Lan et al., 2022).
Step size selection for these regimes is for the strongly convex case and for the non-strongly convex regime.
6. Computational Complexity and Comparison to Batch Policy Gradient Methods
BPMD achieves major computational advantages relative to batch PG methods:
- Batch PG methods (PMD, NPG, etc.):
- Update policy at all states per iteration.
- Exact policy evaluation cost: , improvement step , total per-iteration cost .
- BPMD:
- Updates a single policy block, requiring a rank-one inversion update, giving per iteration.
- Iteration complexity matches batch PG: .
- Overall, BPMD delivers an -fold per-iteration speedup (Lan et al., 2022).
In stochastic settings, BPMD uses samples per iteration compared to for batch PG, yielding another -fold per-iteration speed advantage while maintaining identical sample complexity scaling in .
7. Practical Considerations and Implementation Insights
Key recommendations and properties for effective BPMD use include:
- Granularity and early stopping: Each iteration affects only one or a small subset of states, allowing early interruption and resource-efficient operation.
- Per-iteration efficiency: For large- problems, the complexity is preferable to batch evaluation.
- Parallelizable block updates: BPMD naturally extends to simultaneous updates at multiple states, maintaining convergence rates.
- Sampling strategies: If an approximate visitation distribution is accessible, it is advantageous to sample according to early, switching to uniform sampling to assure optimality.
- Regularizer selection: Negative-entropy regularization is compatible with KL-mirror descent, but other convex penalties are supported if they allow efficient mirror steps (Lan et al., 2022).
BPMD provides a block-coordinate mirror-descent framework for policy optimization in RL, achieving linear convergence under strong regularization, reduced computational and sample costs, and architectural flexibility for large-scale applications. The approach establishes a new perspective for block-coordinate descent in policy gradient optimization frameworks.