OBLR-PO: Optimal Baseline & LR Policy Optimization

Updated 1 December 2025

The paper presents a structured framework that jointly optimizes baseline selection and learning-rate scheduling to enhance convergence and training efficiency in DNN and RL scenarios.
It employs a multi-stage verification process including direct checks, database lookups, and auto-generation to dynamically compose optimal learning-rate policies.
Empirical results show significant improvements in accuracy and iteration reduction across benchmarks, validating the approach in both supervised and reinforcement learning tasks.

Optimal Baseline and Learning-Rate Policy Optimization (OBLR-PO) refers to a systematic suite of algorithms and theoretical tools designed to jointly select or adapt baselines and learning-rate schedules for efficient, stable, and high-performance optimization, particularly in deep neural network (DNN) training and reinforcement learning (RL) policy-gradient settings. Representative instances include the OBLR-PO framework for learning-rate selection, composition, and auto-verification for DNNs (Wu et al., 2022), and a general theoretical framework for variance-minimal baseline adaptation and SNR-governed learning-rate optimization for RL-based LLM post-training (Huang et al., 28 Nov 2025). OBLR-PO formalizes policy selection as a constrained minimization under accuracy and time constraints, characterizes the joint statistical and efficiency properties of gradient estimators, and provides principled mechanisms for both schedule composition and adaptive variance reduction.

1. Formal Problem Definition

In the supervised DNN training context, OBLR-PO casts learning-rate (LR) policy selection as a constrained optimization problem: Given a model $F_\Theta$ , optimizer $O_\eta$ (SGD, Adam, etc.) parameterized by a candidate time-varying rate schedule $\eta(t)$ , and datasets $X^{\mathrm{train}}$ , $X^{\mathrm{val}}$ , the goal is

$\hat\eta = \argmin_{\eta \in P} L_{x \in X^{\mathrm{val}}}[x; F_{\Theta^*}]$

subject to

$\Theta^* = O_\eta(X^{\mathrm{train}}),\quad T_{\mathrm{train}} \le T_{\max},\quad \mathrm{Acc}(F_{\Theta^*}, X^{\mathrm{val}}) \ge \mathrm{Acc}_{\mathrm{target}}$

where $P$ denotes the finite set of admissible LR policies (fixed, decaying, cyclic), each parameterized by a small set of scalars (Wu et al., 2022).

For RL policy optimization, OBLR-PO formalizes the Monte-Carlo policy-gradient estimator with baseline $b_\theta(q)$ as

$\widehat{\nabla_\theta J(\theta)} = \frac{1}{N\,G} \sum_{j=1}^N \sum_{i=1}^G \nabla_\theta \log \pi_\theta(o_{i,j}|q_j)\; \big( F(q_j,o_{i,j}) - b_\theta(q_j) \big)$

and seeks to minimize the optimization loss $L(\theta)=J(\theta^*)-J(\theta)$ subject to smoothness and boundedness conditions, while jointly optimizing baselines and step-sizes (Huang et al., 28 Nov 2025).

2. Baseline Verification, Tuning, and Improvement

A hallmark of OBLR-PO is the establishment and systematic verification of “baseline” policies as starting points for further optimization. For DNN training, baseline LR policies are those recommended by common frameworks (e.g., Caffe, TensorFlow, PyTorch). OBLR-PO employs a 3-stage acceptance procedure (Wu et al., 2022):

Direct Check: A candidate LR policy $\eta_0$ is accepted if, upon a single training run, it exceeds the target accuracy $\mathrm{A}^*$ within the budget $T^*$ .
Database Lookup: If direct check fails, the top- $K$ best-performing historical policies for the same model/task are evaluated. The policy is retained only if it outperforms these candidates.
Auto-Generation: If not, parameter auto-tuning (grid/random) and dynamic schedule adjustment (e.g., Change-LR-On-Plateau) are invoked to discover or compose a superior policy.

Only the top-performing schedule(s) are retained as the “optimal baseline”, rejecting suboptimal defaults.

In the RL-policy gradient scenario, the “variance-optimal baseline” is not a constant or average, but the gradient-weighted average

$b^*(q) = \frac{E_{o \sim \pi}[\|\nabla \log \pi(o|q)\|^2 F(q,o)]}{E_{o \sim \pi}[\|\nabla \log \pi(o|q)\|^2]}$

which strictly minimizes gradient estimator variance under mild assumptions (Huang et al., 28 Nov 2025).

3. Learning-Rate Policy Composition and Adaptation

OBLR-PO supports highly flexible rate-schedule design, including:

Homogeneous Multi-Policy: Piecewise-constant or multi-segment decaying schedules (e.g., NSTEP).
Heterogeneous Multi-Policy: Distinct policy families in time windows, for instance, a sequence of triangular, sinusoidal, and exponential phases. Schedules are dynamically composed:

$\eta(t) = \begin{cases} \eta^{1}(t; P^1) & t_0 \le t < t_1 \ \eta^{2}(t; P^2) & t_1 \le t < t_2 \ \cdots \ \eta^{m}(t; P^m) & t_{m-1} \le t \le t_m \ \end{cases}$

Dynamic adaptation is realized via plateau detectors or algorithmic heuristics that adjust LR sub-policies in response to stagnating metrics during training (Wu et al., 2022).

For RL, the learning rate $\eta_t$ is set adaptively at each iteration by minimizing an upper bound on the optimization loss, producing the SNR-dependent formula: $\eta_t^* = \frac{1}{\beta} \frac{N_t G_t\, \mathrm{SNR}(\theta_t)}{1 + N_t G_t\, \mathrm{SNR}(\theta_t)}$ with $\mathrm{SNR}(\theta_t) = \frac{\|\nabla L(\theta_t)\|^2}{\mathrm{tr}\,H(\theta_t)}$ (Huang et al., 28 Nov 2025).

4. Auto-Tuning and Verification Methodologies

The OBLR-PO framework operationalizes policy optimization via tightly integrated auto-tuning mechanisms:

Change-LR-On-Plateau: Automatically switches to larger or smaller rate policies based on failure of a monitored metric to improve beyond a threshold in a sliding window. If a plateau occurs early, LR is increased to escape; if late, LR is decreased to promote convergence.
Empirical Verification: Newly constructed or externally proposed LR schedules are subject to the same measurement and competitive replacement process as baselines.
Database and Policy Synthesis: Accumulated policy performance data inform direct recommendations or composition of hybrid schedules for novel tasks or architectures (Wu et al., 2022).

A similar adaptive process undergirds the RL instance, with online estimation of step size and variance-reducing baselines.

5. Theoretical Analysis and Convergence Guarantees

OBLR-PO comes with rigorous correctness and convergence analyses:

The unbiasedness of the policy-gradient estimator with a properly conditioned baseline is proven under standard regularity assumptions: the baseline may depend on $q$ but not $o$ , and log-policy is differentiable (Huang et al., 28 Nov 2025).
Exact variance expressions are derived for the estimator and minimized with respect to baseline choice, leading to the gradient-weighted formula.
Loss upper bounds decompose into true gradient descent, curvature penalty, and sampling noise, explicitly controlled by batch size and SNR. The adaptive learning-rate update is obtained via direct minimization of this bound.
With properly chosen schedules and baselines, the algorithm achieves $O(1/\sqrt{T})$ convergence in average squared gradient norm, ensuring that $\epsilon$ -stationarity is obtained in $O(\epsilon^{-2})$ steps (Huang et al., 28 Nov 2025).

For DNN training, convergence and superiority are established empirically via structured comparisons of default, tuned, and composed policies under fixed accuracy and iteration targets (Wu et al., 2022).

6. Empirical Results and Performance Impact

OBLR-PO demonstrates substantial improvements across standard supervised and RL benchmarks. Selected experimental highlights:

Task / Model	Baseline Policy	OBLR-PO Best Policy	Acc. or Pass@1	Iteration Reduction
MNIST / LeNet	FIX(0.01)	SIN2(k₀=0.01, k₁=0.06, l=2K)	99.33%	6.7×
CIFAR-10 / CNN3	NSTEP (default)	SINEXP / Composite (TRI→TRI2→TRI2)	82.53%	5.08×
CIFAR-10 / ResNet-32	NSTEP (default)	Composite, 3-stage TRI	92.91%	—
ImageNet / ResNet-18	STEP drop@30ep	FIX→SINEXP→FIX→FIX	69.05% Top-1	—

(Wu et al., 2022)

For RL-based LLM post-training, OBLR-PO achieves highest or tied-best Pass@1 accuracy on Qwen3-4B-Base and Qwen3-8B-Base across OlympiadBench, GSM8K, AIME25, MATH500, and AMC23, surpassing GRPO, PPO, ReMax, and RLOO. Additionally, OBLR-PO exhibits smoother training loss and gradient-norm curves, as well as higher average advantage (Huang et al., 28 Nov 2025).

7. Connections and Extensions

OBLR-PO generalizes and unifies multiple strands of optimization research: constrained learning rate schedule selection; hybrid and dynamic schedule composition; variance-reduced policy-gradient techniques; and adaptive, SNR-based step-size control. The integration of baseline selection—ranging from practical defaults to computationally optimal, variance-minimizing functions—and learning-rate policy selection provides a theoretically justified and empirically validated framework for both supervised and reinforcement learning contexts.

A plausible implication is that further extension of OBLR-PO to joint optimizer adaptation—encompassing parameter initialization, regularization, and batch schedule—could yield additional performance and stability gains in complex training regimes.