Reinforcement Learning Formulation

Updated 15 April 2026

Reinforcement learning formulation is the precise mathematical specification of agent-environment interactions using MDPs and extended frameworks.
It underpins both canonical single-objective setups and complex scenarios like multi-objective, risk-aware, and distributional RL for robust performance guarantees.
Well-designed RL models improve sample efficiency, convergence robustness, and transferability, impacting real-world applications and industrial controls.

Reinforcement learning (RL) formulation is the rigorous specification of the environment, objectives, and agent-environment interaction protocol that enables the mathematical analysis and algorithmic design of RL systems. The RL formulation determines not just the behavior and performance guarantees of agents but also their compatibility with optimization algorithms, scalability, sample efficiency, and robustness to real-world constraints. Modern research extends the standard expectation-maximizing paradigm to accommodate multiple objectives, temporal abstractions, safety constraints, distributional risk, reliability, and domain-specific requirements, each giving rise to distinct formal frameworks and solution methodologies.

1. Canonical Single-Objective RL Formulation

The classical RL setting is based on the Markov Decision Process (MDP) formalism:

$(\mathcal{S}, \mathcal{A}, P, \mu, \gamma, r)$

where $\mathcal{S}$ is the state space, $\mathcal{A}$ the action space, $P(s'|s,a)$ the Markovian transition kernel, $\mu$ the initial state distribution, $\gamma\in[0,1)$ the discount factor, and $r:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ the reward function. The agent executes a stationary or history-dependent policy $\pi(a|s)$ , generating trajectories $\tau=(s_0,a_0,s_1,a_1,\dots)$ .

The canonical objective is to maximize the expected (discounted or average) cumulative reward: $J(\pi) = \mathbb{E}_{\pi, P}\Big[\sum_{t=0}^{\infty} \gamma^t r(s_t,a_t)\Big].$ The value function $\mathcal{S}$ 0 and Q-function $\mathcal{S}$ 1 satisfy Bellman equations, supporting dynamic programming, temporal difference learning, and various actor-critic architectures (Li et al., 2019).

2. Extensions Beyond the Standard RL Objective

Multi-Objective RL (MORL)

In multi-objective RL, the reward function is a vector $\mathcal{S}$ 2, and policy evaluation yields a return vector $\mathcal{S}$ 3 representing the expected discounted reward for each component.

A prominent formulation is the max-min objective (Park et al., 2024): $\mathcal{S}$ 4 ensuring fairness by seeking policies that optimize the worst-case reward component. The LP and Bellman operator formulations, alongside entropy regularization for unique policies and tractable optimization, underpin practical model-free algorithms. Key theoretical results include the existence, uniqueness, and convexity of the value function in weight-space, and piecewise linearity for the unregularized setting.

Micro-Objective RL

Micro-objective RL specifies the task via a collection of temporally and spatially localized "micro-objectives"—binary events encoded as $\mathcal{S}$ 5—and optimizes the vector of their long-run success probabilities under a user-specified partial order, encompassing risk-sensitive and temporally-abstracted settings (Li et al., 2019).

Maximum Reward RL

The maximum reward RL formulation seeks to maximize the expected maximum (discounted) reward along a trajectory: $\mathcal{S}$ 6 with novel Bellman operators of the form

$\mathcal{S}$ 7

that remain $\mathcal{S}$ 8-contractions and support convergence of value iteration and Q-learning (Gottipati et al., 2020).

Distributional and Robust RL

Distributional RL maintains an explicit estimate of the return distribution $\mathcal{S}$ 9, beyond its mean, enabling risk-aware and robust optimization: $\mathcal{A}$ 0 This risk-averse surrogate, motivated by robust control with $\mathcal{A}$ 1-divergence constraints, penalizes return variance to generate policies resistant to model uncertainty (Clavier et al., 2022).

3. Bellman Operators and Associated Fixed-Point Theory

The core of RL formulation is the Bellman operator, mapping value functions to their one-step expected improvements and governing the fixed points and convergence of classical and generalized RL algorithms. Variants include:

Standard Bellman: $\mathcal{A}$ 2
Max-min Bellman: Incorporates $\mathcal{A}$ 3 for scalarizations $\mathcal{A}$ 4 and soft Bellman for entropy-regularized settings (Park et al., 2024)
Reliability Bellman: For probability-of-success objectives,

$\mathcal{A}$ 5

which can be recast as an augmented MDP with standard operators (Farhi, 20 Oct 2025)

The contraction property of these operators (with respect to the $\mathcal{A}$ 6 norm for scalar and in appropriate metrics for distributions) ensures unique value functions and algorithmic convergence in classical, risk-averse, and max-min RL settings, with rare exceptions (see discussion of non-contractiveness in (Clavier et al., 2022)).

4. Model-Free Algorithms and Training Schemes

RL formulations dictate algorithmic development:

Actor-Critic and Value-Based Methods: Applied to standard and multi-objective RL, with soft Q-learning for entropy-regularized cases and alternated weight learning for scalarizations (Park et al., 2024).
Distributional Methods: Quantile-based critics for risk-averse settings (Clavier et al., 2022).
Policy Gradient: Direct optimization of parameterized stochastic policies, e.g., REINFORCE or PPO, extendable to continuous action and high-dimensional state spaces (Peng et al., 2023, Costello et al., 2023).
State-Augmentation: Reliable RL augments the state with return-thresholds, enabling standard deep RL in a higher-dimensional space (Farhi, 20 Oct 2025).
Sample-Efficient Optimization: Sophisticated replay schemes for real-world data limitations, including random-anchoring and bootstrapped terminal-value estimation (Peng et al., 2023).

5. Design of State, Action, and Reward Spaces

The formalization of RL environments can be critical for stability, real-world transfer, and interpretability:

Observations/Actions: Normalized and physically-motivated representations facilitate gradient-based optimization and hardware deployment (Schäfer et al., 26 Mar 2025, Molaei et al., 5 Oct 2025).
Reward Function Engineering: Dense guidance components, action-penalties, and goal bonuses are employed for effective credit assignment and hardware safety (Molaei et al., 5 Oct 2025, Schäfer et al., 26 Mar 2025).
Randomization and Generalization: State and action randomization, extended horizons, and varied initializations improve generalizability and robustness in high-fidelity tasks.

6. Problem-Specific RL Formulations

Domain-specific tasks are formulated via specialized MDP/MDP-like structures:

Polyhedral Optimization: RL as schedule-space exploration via sequential MDPs, enabling legal transformation synthesis with task-specific reward measuring speedup over baseline optimizers (Brauckmann et al., 2021).
Queue Stability and Lyapunov Optimization: RL-based drift–plus–penalty formulations using reward shaping to ensure queue stability without per-step convex programs (Bae et al., 2020).
Financial and Goal-Conditioned Tasks: State definitions as windows of historical metrics, reward as tracking error or utility, and transition dynamics that encode domain constraints (Peng et al., 2023, Alaluf et al., 2024).
Safe Exploration: Generalized safe RL (GSE) formalized as a CMDP with pointwise high-probability constraints, solved with the MASE meta-algorithm using uncertainty quantifiers and surrogate rewards (Wachi et al., 2023).
Minimum Discounted Reward for Reachability: Adapting discounted-reward value iteration for Hamilton–Jacobi PDEs and backward reachable set computation, enabling unique fixed points and model-free RL methods for control-theoretic safety (Akametalu et al., 2018).

7. The Impact of RL Formulation on Learning and Deployment

Precise RL formulation is pivotal for bridging research and practical deployment. Well-chosen design principles (normalization, episode randomization, action-penalization, randomized targets) can lead to orders-of-magnitude improvements in sample efficiency, convergence robustness, and transfer fidelity to physical hardware in industrial controls and complex manipulation (Schäfer et al., 26 Mar 2025, Molaei et al., 5 Oct 2025). In LLMs, new RL formulations provide theoretical foundations for token-level surrogate objectives and stabilization regimes that control discrepancies and policy staleness, ensuring stable on-policy and off-policy RL training at scale (Zheng et al., 1 Dec 2025).

In summary, the RL formulation not only specifies the learning problem but critically determines theoretical tractability, algorithmic viability, and empirical success. Current research consistently demonstrates that rethinking and formalizing the RL problem to match task realities is a central driver for progress, from multi-objective fairness and distributional robustness to reliable performance and safe deployment (Park et al., 2024, Clavier et al., 2022, Farhi, 20 Oct 2025, Schäfer et al., 26 Mar 2025, Wachi et al., 2023, Zheng et al., 1 Dec 2025).