Reinforcement Learning in Sequential Optimization

Updated 20 November 2025

The paper introduces an MDP-based framework that embeds optimization tasks, enabling non-myopic, data-efficient sequential decision-making.
It details a methodology where neural network policies and policy gradient methods (e.g., REINFORCE) select batch actions to outperform classical optimization routines.
Empirical results demonstrate enhanced sample efficiency and generalization in applications like function minimization and engineering design.

Reinforcement learning aided sequential optimization is a class of methodologies that cast iterative or batch decision-making—central to many optimization, control, design, or planning problems—as Markov decision processes (MDPs) or partially observed MDPs, and solve them by reinforcement learning (RL) to directly optimize sequential performance. This framework enables the learning of policies that act in non-myopic, history-dependent, and data-efficient ways, especially for domains characterized by expensive black-box evaluations, combinatorial configuration spaces, constraints, or non-differentiable objectives. RL-aided sequential optimization has shown demonstrable benefits in Bayesian optimal experimental design, black-box function minimization, engineering systems, and combinatorial resource allocation, with empirical results indicating superior sample efficiency, generalization, or solution quality compared to classical heuristics or static optimization approaches.

1. Mathematical Formalism and MDP Architecture

At the core, RL-aided sequential optimization involves embedding an underlying optimization task within the MDP formalism. For instance, in batch Bayesian optimal experimental design, the process is specified as follows (Ashenafi et al., 2021):

State space: At step $i$ , the state $s_i$ encodes salient summaries of all previous queries and outcomes. In BSSRL, $s_i$ comprises the most recent batch of queries $X^{(i)}$ , the GP posterior mean and standard deviation at the minimum $\tilde x_{\min}$ , i.e., $s_i = \{X^{(i)}, m_i(\tilde x_{\min}), \sigma_i(\tilde x_{\min})\}$ .
Action space: The agent selects the next batch of experiments $X^{(i+1)}$ by outputting the parameters of a sampling distribution (e.g., the mean and scale of a factorized Gaussian) from which $n$ candidate points are drawn.
Transition function: Execution of the batch action yields observations/future data, which update the Bayesian surrogate (e.g., GP posterior), thereby inducing the next state.
Reward function: The reward $r_i$ captures the progress from the previous to the current batch, typically combining reduction in posterior mean and uncertainty at the incumbent minimizer: $r_i = -\{ m_i(\tilde x_{\min}) - m_{i-1}(\tilde x_{\min}) + \alpha [\sigma_i(\tilde x_{\min})-\sigma_{i-1}(\tilde x_{\min})] \}$ , with $\alpha$ a trade-off parameter.
Episode structure: A fixed budget $S = mn$ is split over $m$ sequential steps (each step corresponding to a batch of size $n$ ).

The RL objective is to maximize the cumulative discounted (or undiscounted) reward over the experimental or optimization horizon, directly aligning with long-term, non-myopic objectives.

2. Policy Parameterization and Learning Algorithms

Parametric policies for sequential optimization are most commonly realized as neural networks mapping states to action distributions. In BSSRL, the policy $\pi_\theta(a_i|s_i)$ is a factorized Gaussian whose mean and scale are produced by a two-layer multilayer perceptron (MLP) with 16 units per layer and ReLU activations (Ashenafi et al., 2021). Batch actions are sampled from this distribution, and post-processing (such as snapping to a discrete set) may be performed.

Training is performed using policy gradient algorithms, such as REINFORCE. The gradient estimator for the expected cumulative return $J(\theta)$ leverages the sampled trajectory of actions and states, employing baseline and discounting as appropriate:

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{i=0}^{m-1} \nabla_\theta \log \pi_\theta(a_i|s_i) G_i \right ],$

where $G_i$ is the sum of discounted future rewards from step $i$ .

Closed-form parameter updates for both mean and scale are provided in the BSSRL algorithm, and key hyperparameters include the policy learning rate $\alpha$ , episode length $m$ , batch size $n$ , and total number of training episodes. Compared to classic optimization routines that are open-loop or greedy, RL-trained policies directly optimize policies for the full sequential, often non-myopic, setting.

3. Task Structure, Reward Design, and Trade-offs

A defining characteristic of RL-aided sequential optimization is the explicit encoding of exploration–exploitation and budget/allocation trade-offs into the reward function. The choice of per-step reward (e.g., incremental reduction in GP minimum and variance) ensures the agent accounts for both local improvement and global knowledge acquisition (Ashenafi et al., 2021). The use of cumulative (potentially discounted) rewards over multiple sequential steps ensures the policy considers the full budget, not just immediate gains.

For batch queries, the episode structure ensures sequential adaptation: after each batch, the agent computes the updated state incorporating all new observations, maintaining the sequential essence even with batch execution. This protocol allows efficient utilization of limited experimental budgets when single evaluations are costly or parallelization is possible but non-myopic sequential adaptation is still essential.

4. Applications and Empirical Performance

RL-aided sequential optimization has been empirically evaluated across synthetic and high-dimensional real-world problems (Ashenafi et al., 2021):

Function minimization and experimental design: On synthetic 2D function minimization tasks (e.g., Ackley and Booth functions), RL-based BSSRL achieves the global minimum in substantially fewer function evaluations compared to batch Bayesian optimization methods—finding the global minimizer in 20–30 queries (batch size 5 or 10) versus about three times as many queries for classical batch-BO with local penalization.
High-dimensional engineering design: In airfoil efficiency optimization (12-D problem, 600-sample database), BSSRL policies trained on synthetic objectives generalize well, selecting high-efficiency designs in 10 fewer queries than greedy batch-BO. Batches of size 1 or 5 perform best, with larger batch sizes yielding modest degradation that remains competitive due to state discretization and the grid structure.
Generalization and meta-learning: RL agents can be trained on one class of optimization tasks and subsequently applied to others without retraining, suggesting meta-learning capability (Ashenafi et al., 2021).

Characteristic empirical signatures include:

Fast reduction of both posterior mean and uncertainty near the optimal design.
More space-filling exploration adjacent to true minima compared with clustering of batch-BO.
Improved sample efficiency as batch size increases.

5. Theoretical Connections and Generalization

The RL-aided sequential optimization paradigm interpolates between classical Bayesian optimal experimental design and meta-learning. By encoding Bayesian posterior summaries as the RL state and learning non-myopic policies, RL lifts myopic Bayesian one-step-ahead design to full-horizon, batch-interleaved optimization that respects both budget constraints and adaptation.

Formulations such as REINFORCE-OPT generalize this framework to generic (potentially infinite-dimensional) Hilbert spaces, with guaranteed almost-sure convergence to the set of policy parameter local maxima under standard stochastic approximation conditions (Xu et al., 2023).

The formalism also supports connections with classical regularization (e.g., the Tikhonov and iterative Landweber methods become special cases of particular RL policy families), linking uncertainty quantification, point estimation, and solution multiplicity detection to the RL policy output (Xu et al., 2023).

6. Algorithmic Extensions, Limitations, and Future Directions

The RL-aided sequential optimization framework is extensible to other settings:

Hierarchical or multi-level batch selection,
Integration with safety constraints or combinatorial action spaces,
Deployment in stochastic, partially observed domains where Bayesian surrogates are intractable or unavailable.

Limitations and practical considerations include:

Degraded performance with very large batch sizes when discretization or grid effects impede exploration.
Requirement for careful tuning of policy network architectures and reward parameters to stabilize training.
The POMDP nature of the GP state encoding, which may create representational challenges as problem complexity grows.

Possible future directions include scaling to larger/hierarchical batch structures, principled uncertainty quantification in high-noise or ill-posed applications, and further exploration of meta-learning potentials for transfer across problem classes.

References

"Reinforcement Learning based Sequential Batch-sampling for Bayesian Optimal Experimental Design" (Ashenafi et al., 2021)
"Reinforcement-learning-based Algorithms for Optimization Problems and Applications to Inverse Problems" (Xu et al., 2023)

PDF Markdown Chat (Pro)

References (2)

Reinforcement Learning based Sequential Batch-sampling for Bayesian Optimal Experimental Design (2021)

Reinforcement-learning-based Algorithms for Optimization Problems and Applications to Inverse Problems (2023)

Follow Topic

Get notified by email when new papers are published related to Reinforcement Learning Aided Sequential Optimization.