Average-reward reinforcement learning in semi-Markov decision processes via relative value iteration (2512.06218v1)
Abstract: This paper applies the authors' recent results on asynchronous stochastic approximation (SA) in the Borkar-Meyn framework to reinforcement learning in average-reward semi-Markov decision processes (SMDPs). We establish the convergence of an asynchronous SA analogue of Schweitzer's classical relative value iteration algorithm, RVI Q-learning, for finite-space, weakly communicating SMDPs. In particular, we show that the algorithm converges almost surely to a compact, connected subset of solutions to the average-reward optimality equation, with convergence to a unique, sample path-dependent solution under additional stepsize and asynchrony conditions. Moreover, to make full use of the SA framework, we introduce new monotonicity conditions for estimating the optimal reward rate in RVI Q-learning. These conditions substantially expand the previously considered algorithmic framework and are addressed through novel arguments in the stability and convergence analysis of RVI Q-learning.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (big picture)
This paper studies how to teach a computer to make good decisions over a long time when actions take different amounts of time to complete. Think of running a factory: some actions are quick, others take longer. The goal is to maximize the “average reward per unit time” in the long run (like average profit per hour), not just the total points or short-term gains.
The authors focus on a type of learning called reinforcement learning (RL) for semi-Markov decision processes (SMDPs). SMDPs are like regular Markov decision processes (MDPs), but they also keep track of how long each action takes. They analyze and extend a classic method called relative value iteration (RVI) and its model-free version, RVI Q-learning. Their main achievement is proving that a generalized RVI Q-learning method really does converge (i.e., it settles on good answers) under broad conditions—even when the problem is complicated and the best solution isn’t unique. They also give practical rules to make it converge to a single best answer.
The main questions the paper asks
- How can we design a model-free RL algorithm that optimizes average reward per unit time in SMDPs, where actions have different durations?
- Can we prove that this algorithm is stable (doesn’t blow up) and converges (settles down), even when not all parts of the system are updated at every step (asynchronous updates)?
- Can we allow a much wider class of “average-reward estimators” inside the algorithm and still guarantee convergence?
- Under what extra, practical conditions does the algorithm converge to a single, specific optimal solution (not just somewhere within a set of good solutions)?
How they approached it (methods explained simply)
To understand the methods, it helps to picture the learning process as tuning many dials (numbers) over time:
- Q-values: These are scores for how good it is to take an action in a state.
- T-values: These estimate how long each action typically takes (the “holding time”).
The algorithm:
- Keeps two tables: Q for value estimates and T for time estimates.
- At each step, it updates only some entries (asynchronously), depending on which data arrived—like updating only the dials you got new info about.
- For a picked state-action pair (s, a), it:
- Uses fresh data: the next state, the time taken, and the reward observed.
- Updates T(s, a) by averaging in the new observed time.
- Updates Q(s, a) by nudging it toward “reward + best future Q − current Q,” but crucially divides by the current time estimate T(s, a) (so actions that take longer are treated differently).
- Subtracts a special number f(Q) that acts like the current guess of the long-run average reward rate. This subtraction makes the process “self-regulating,” preventing Q from drifting away.
Key ideas behind the math:
- Asynchronous stochastic approximation: Not all components are updated every time, but the math ensures that over time, every part gets updated often enough, and the step sizes shrink at the right speed.
- ODE method (ordinary differential equations): Imagine zooming out so far that the noisy updates look like a smooth curve following a system of differential equations. If that smooth system settles to the right place, the original noisy algorithm will too.
- Noise terms: Randomness in rewards and transitions is handled as “noise.” There’s also “biased noise” from time estimates because T is learned from data; the authors design conditions so this bias fades fast enough.
- New monotonicity condition (SISTr): The function f that estimates the average reward rate must be strictly increasing if you add the same number to all Q-values. In plain words: if you lift all Qs up, f(Q) must also go up. This ensures the “thermostat-like” self-regulation works.
Two practical step-size schedules are analyzed:
- Class 1: learning rate ~ 1/(A n)
- Class 2: learning rate ~ 1/(A n ln n) Here, A is a scaling constant. The choices ensure learning slows down in a controlled way.
What they found and why it matters
Main results:
- General convergence under broad conditions
- The generalized RVI Q-learning method (with the new, more flexible f) converges almost surely (with probability 1) in finite, weakly communicating SMDPs. “Weakly communicating” means the system’s structure guarantees the long-run average reward rate is the same no matter where you start.
- Even when the exact best Q-values aren’t unique (which often happens in these problems), the algorithm converges to a compact, connected subset of the optimal solutions to the average-reward optimality equation (AOE). In simple terms: it settles into the right region.
- Convergence to a unique solution (with extra conditions)
- If you choose step sizes and update frequencies carefully, and the small biases from time estimation shrink fast enough, the algorithm converges to a single (path-dependent) optimal solution. Path-dependent means: due to randomness, different runs might lock onto different points in the optimal set, but each run still picks one point and sticks with it.
- Much more flexible average-reward estimation
- The new SISTr condition lets f be very general: not just simple averages or maxima, but combinations (like max of several, or a monotone function of several estimates). This widens what algorithm designers can do while still retaining convergence guarantees.
- Stronger theory and fixed earlier gaps
- Earlier RVI Q-learning proofs had some technical gaps around stability in the asynchronous setting. The authors use strengthened results from their recent theory of asynchronous stochastic approximation to close those gaps and extend the guarantees to SMDPs.
Why this is important:
- Average-reward problems fit many real-world tasks better than discounted-reward ones when we care about ongoing performance (e.g., average throughput, average energy efficiency).
- SMDPs are realistic because actions can take different amounts of time (think: robotics tasks, networking, operations).
- Having rigorous guarantees for a flexible, model-free method means practitioners can design more robust algorithms with confidence they’ll work.
What this means going forward
- Practical impact: The results support reliable, model-free learning in systems with variable action durations—common in robotics, operations research, and hierarchical control—where steady long-term performance is key.
- Design freedom: Engineers can craft the average-reward estimator f from multiple signals (e.g., max, min, or other monotone combinations) while keeping stability and convergence.
- Better theory for RL: The paper strengthens the mathematical foundation for average-reward RL with asynchronous updates, giving clear, implementable conditions for convergence—sometimes even to a single best solution.
In short, this work makes average-reward RL for time-sensitive tasks both more powerful and more trustworthy.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, focused list of what remains missing, uncertain, or unexplored, phrased to be concrete and actionable for future research.
- Finite-sample performance: No non-asymptotic (finite-time) convergence rates, sample complexity bounds, or iteration complexity are provided for the generalized RVI Q-learning algorithm under the presented SA framework.
- Practical verification of the additional noise condition (Assumption cond-mns): The requirement that the biased noise terms decay at an exponential rate in ODE-time (via ln(δn)/Σαk ≤ μδ < −Lh) is strong and not shown to be implied by the proposed holding-time estimator. A derivation of sufficient conditions (on βn, ηn, the distribution of τ, and the estimator) under which this assumption provably holds is missing.
- Adaptive stepsize scaling without Lh: The unique-convergence theorem requires stepsize classes where A exceeds thresholds involving the (unknown) Lipschitz constant Lh. A practical method to estimate or upper-bound Lh online, or design adaptive stepsizes that guarantee the required threshold without prior model knowledge, is not developed.
- Choice and design of f beyond SISTr: While strict monotonicity under scalar translation (SISTr) broadens the admissible class of f, the paper does not characterize the minimal monotonicity/regularity conditions needed for stability and convergence. A complete taxonomy of permissible f (and ψ compositions) and how different choices affect convergence speed, stability margins, and the selected limit point remains open.
- Selection principle for the limit point: In weakly communicating SMDPs where the AOE solution set has multiple degrees of freedom, the algorithm converges (under extra conditions) to a sample-path-dependent unique solution, but the mechanism that determines which solution is selected is not characterized. How exploration schedules, initialization, function f, or stepsizes influence the selection—and how to steer the algorithm to a desired solution (e.g., one minimizing a norm or satisfying auxiliary criteria)—is unresolved.
- Generative-model versus online interaction: The algorithm’s update rule assumes “freshly generated data” for each chosen (s, a) in Yn. A rigorous analysis for the standard online RL setting (single transition per step, dependent trajectories, on-policy or off-policy sampling without a generative model) is missing, including how to meet the asynchrony coverage requirements and maintain the martingale difference noise properties.
- Exploration and coverage in large spaces: The asynchrony conditions require each component to be updated with positive asymptotic frequency and balanced stepsize exposure. Concrete exploration strategies that ensure these conditions in large state-action spaces (without a generative model) and their impact on convergence guarantees are not specified.
- Robustness to holding-time distributions: The analysis assumes E[τ2] < ∞ and leverages ηn as a vanishing lower bound. The algorithm’s robustness and required modifications under heavy-tailed holding times, near-zero holding times, or unknown/variable lower bounds (risking large update magnitudes) have not been studied.
- Two-timescale dynamics (αn vs βn): The interaction between the Q-updates (αn) and holding-time estimation (βn) is not analyzed beyond standard SA conditions. Whether timescale separation (e.g., βn faster than αn) is necessary or beneficial, and how it affects bias/variance trade-offs and convergence speed, is an open question.
- Extension beyond finite spaces: The results are limited to finite state/action SMDPs. Extensions to continuous or large-scale settings with function approximation (linear or nonlinear), stability under approximation error, and compatibility with off-policy learning are not addressed.
- Multichain SMDPs with non-constant r*: The paper assumes weakly communicating SMDPs (constant optimal reward rate). Extending the analysis to multichain settings where r* depends on the initial state, including convergence, stability, and solution selection, remains open.
- Empirical evaluation and practical guidance: No experimental validation, sensitivity analyses, or implementation guidelines (e.g., choosing f and ψ, tuning A, βn, ηn, or exploration policies) are provided. Practical heuristics for balancing convergence speed, stability, and solution selection in real tasks are missing.
- Continuous-time control beyond standard SMDPs: Although the paper notes relevance to continuous-time control via Markov chain approximation, a formal treatment showing how the SA-based RVI Q-learning extends to controlled diffusions or continuous-time models with actions changing between transitions is not provided.
- Relaxations of SISTr: The paper remarks that SISTr can be relaxed but does not specify the exact weaker conditions under which stability and convergence proofs still go through. Identifying and proving the broadest admissible monotonicity conditions is an open theoretical gap.
- Impact of nonexpansivity/noncontractivity: Average-reward mappings lack contraction properties; while the SA framework handles this asymptotically, guidance on algorithmic mechanisms (beyond partial asynchrony) that could improve stability or speed (e.g., regularization, averaging, damping) without violating the SA assumptions is not explored.
Glossary
- Average-reward optimality equation (AOE): The Bellman-type equation characterizing optimal average reward and differential values in MDPs/SMDPs. "the average-reward optimality equation (AOE) admits a unique solution (up to an additive constant)"
- Asynchronous stochastic approximation (SA): A class of iterative algorithms that update components irregularly over time to track solutions of associated ODEs. "asynchronous stochastic approximation (SA)"
- Autonomous ODE: An ordinary differential equation whose dynamics do not explicitly depend on time. "associated (autonomous and non-autonomous) ODEs"
- Borel probability measure: A probability measure defined on the Borel σ-algebra of a topological space (here, on state–time–reward spaces). "given by a (Borel) probability measure"
- Borkar--Meyn framework: A theoretical framework for analyzing stability and convergence of stochastic approximation via ODE methods. "within the Borkar--Meyn framework"
- Borkar--Meyn stability criterion: A condition ensuring SA iterates remain bounded by examining the limiting ODE under scaling. "the Borkar--Meyn stability criterion"
- Closed communicating class: A set of states mutually reachable under some policy and not exited under any policy. "a unique closed communicating class---a set of states such that starting from any state in the set, every state in it is reachable with positive probability under some policy, but no states outside it are ever visited under any policy"
- Dynamical system approach: An analysis method relating SA trajectories to dynamical systems to study convergence and stability. "the dynamical system approach of Hirsch and Benaïm"
- Equilibrium set: The set of points where the drift function of the associated ODE vanishes. "equilibrium set $E_h \= \{ x \in R^d \mid h(x) = 0 \}$."
- Globally asymptotically stable equilibrium: An equilibrium point to which all solutions converge and which is Lyapunov stable. "has the origin as its unique globally asymptotically stable equilibrium."
- Hierarchical control: A control framework decomposing decision-making into multiple levels or subtasks. "hierarchical control in average-reward MDPs"
- Holding time: The random duration until the next state transition in an SMDP. "known as the holding time"
- Law of the iterated logarithm (LIL): A probabilistic result describing almost sure fluctuations of partial sums and empirical frequencies. "by the law of the iterated logarithm"
- Lipschitz constant: The smallest constant bounding how fast a function can change relative to changes in its input under a given norm. "Let be the Lipschitz constant of under ."
- Lipschitz continuous: A regularity property ensuring a function does not change faster than linearly with its input. " is a Lipschitz continuous function"
- Martingale difference noise terms: Zero-mean noise components in SA satisfying conditional moment bounds. "the standard condition on the martingale difference noise terms "
- Markov chain approximation method: A numerical approach approximating continuous-time control by discrete-time Markov chains. "through the Markov chain approximation method"
- Non-autonomous ODE: An ordinary differential equation with dynamics explicitly depending on time. "associated (autonomous and non-autonomous) ODEs"
- Nonexpansive mappings: Mappings that do not increase distances, often used in fixed-point iteration analyses. "involving nonexpansive mappings"
- Ordinary differential equation (ODE): A continuous-time equation describing the evolution of a state under a drift function. "Consider the ODE "
- Partial asynchrony mechanism: A scheduling condition ensuring asynchronous updates mimic averaged synchronous behavior asymptotically. "together they establish a partial asynchrony mechanism"
- Positively homogeneous: A function property where scaling the input by a nonnegative factor scales the output by the same factor. "positively homogeneous (i.e., for )"
- Relative value iteration (RVI): An algorithm that iteratively refines value estimates relative to a baseline to solve average-reward problems. "relative value iteration (RVI)"
- Renewal theory: A probabilistic framework for processes with repeated, IID cycles, used to justify average limits. "according to renewal theory (see \citep{Ros70})."
- RVI Q-learning: A model-free, asynchronous RL algorithm using RVI principles to solve average-reward MDPs/SMDPs. "RVI Q-learning"
- Sample path-dependent: A property where the limit or outcome can vary depending on the realized stochastic trajectory. "depends on the sample path"
- Scaling limit: The limit of a sequence of rescaled functions describing asymptotic behavior under dilation. "the existence of the scaling limit "
- Semi-Markov decision process (SMDP): A generalization of MDPs where transitions occur after random holding times. "semi-Markov decision processes (SMDPs)"
- Shadowing properties: The phenomenon where noisy or discretized trajectories closely track true ODE solutions asymptotically. "analysis of shadowing properties of asynchronous SA"
- Sigma-fields (σ-fields): Collections of sets closed under complement and countable unions, supporting probability measures. "an increasing family of -fields"
- Stochastic kernel: A measurable mapping that assigns a probability distribution over actions given current information. "a Borel-measurable stochastic kernel "
- Strictly increasing under scalar translation (SISTr): A monotonicity property where adding a scalar to all components strictly increases the function value. "We call strictly increasing under scalar translation (SISTr)"
- Unichain SMDP: An MDP/SMDP whose induced Markov chain under any stationary policy has a single recurrent class. "unichain MDPs or SMDPs"
- Uniform convergence on compact subsets: Convergence of functions uniformly over every compact subset of the domain. "uniform convergence on compact subsets of the domain"
- Weakly communicating SMDP: An SMDP with a unique closed communicating class and possibly transient states, yielding a constant optimal reward rate. "weakly communicating SMDPs"
Practical Applications
Immediate Applications
The following applications can be deployed now using the paper’s algorithms, conditions, and analysis. They are most effective in continuing, event-driven decision problems where actions are applied at transition times and durations vary (SMDPs), and they also specialize cleanly to MDPs (unit holding times).
- Average-reward SMDP control for event-driven operations (manufacturing, call centers, cloud autoscaling, wireless scheduling) — Industry, Operations Research
- What: Optimize long-run throughput or service level per unit time when job/service durations vary. Replace heuristics or discounted RL with model-free RVI Q-learning that directly targets average reward rate.
- Tools/workflow:
- Integrate an RVI Q-learning module that maintains Q-values and expected holding-time estimates T(s,a).
- Use a strictly increasing-under-scalar-translation f(·) to “self-regulate” the average reward estimate (e.g., f as max/min/affine combinations or composite monotone aggregators of multiple critics/signals as enabled by the paper).
- Stepsizes: diminishing αn, βn; ensure asynchronous updates visit all (s,a) sufficiently often (e.g., behavior policy with persistent exploration).
- Assumptions/dependencies: Finite S, A; weakly communicating SMDP (constant optimal average reward across initial states); stationary dynamics; observed (s,a,τ,r,s′) tuples with finite second moments; exploration meets the asynchrony conditions; unbiased (martingale difference) transition noise; decaying bias in holding-time estimates via βn and ηn.
- Patient-flow and bed-management optimization with variable lengths of stay — Healthcare
- What: Learn admission/scheduling policies maximizing average discharges or utility per unit time with stochastic lengths of stay.
- Tools/workflow: Connect hospital simulation/EMR-derived transition logs (s,a,τ,r,s′) to RVI Q-learning; estimate T(s,a) on-line; use f to stabilize and combine multiple signals (e.g., average of top-k Q’s).
- Assumptions/dependencies: Approximately stationary transition/length-of-stay distributions; sufficient coverage over actions; adequate data quality for second-moment bounds.
- Traffic signal timing and emergency response dispatch with event-driven durations — Public Sector, Mobility
- What: Optimize average delay per unit time or average service level in systems where events (arrivals, incidents) trigger actions with random handling times.
- Tools/workflow: RVI Q-learning applied to traffic simulators or dispatch simulators; asynchronous updates aligned with event logs; use ηn to lower-bound holding-time scaling; function f configured to be SISTr and Lipschitz (e.g., max/min/affine).
- Assumptions/dependencies: Weakly communicating or single recurrent class under exploratory policy; data or high-fidelity simulators to meet SA noise/stepsize conditions.
- Options and hierarchical RL in continuing tasks — Robotics, Industrial Automation
- What: Use the algorithm for semi-Markov options with average-reward optimization at the higher level (average reward per unit time), leveraging the paper’s convergence guarantees for SMDPs and weakly communicating settings.
- Tools/workflow: Train options/policies with semi-Markov terminations; maintain T(s,option); choose f to integrate multiple critics or signals; apply partial asynchrony via schedule/experience replay that meets the required update-frequency ratios.
- Assumptions/dependencies: Reliable termination-time estimation; sufficient exploration across options; stability validated via the Borkar–Meyn criterion met by the induced h.
- Average-reward learning for continuing software systems — Recommenders, Ad serving, Routing
- What: Optimize average clicks/engagement per unit time (or revenue per unit time), treating inter-event times as holding times (SMDP), or apply MDP specialization (unit times).
- Tools/workflow: Logged bandit/RL data streams as asynchronous updates; βn for holding-time estimation (if durations vary); function f as a configurable monotone combiner (e.g., max or weighted average), enabling robust reward-rate tracking.
- Assumptions/dependencies: Stationary or slowly drifting environment; exploration ensures that all (s,a) pairs are updated; for MDPs, εn=0 simplifies requirements.
- Stability and convergence checklists for production RL — MLOps, Safety Engineering
- What: Use the paper’s assumptions and ODE-time shadowing insights to design/update policies with verifiable convergence in average-reward settings.
- Tools/workflow:
- Stepsize templates (class-1 1/(An), class-2 1/(An ln n)); ensure A thresholds vs. a computable Lipschitz bound Lh (bounded using a known lower bound on min expected holding time).
- Asynchrony instrumentation: verify per-component update frequencies and ratio conditions; log δn for biased-noise decay in SMDPs (from holding-time estimation).
- Assumptions/dependencies: Ability to bound Lh (or use an effective upper bound derived from a lower bound on holding times); enforce asynchronous update regularity.
- Academic use: reproducible average-reward RL labs and proofs — Education, Research
- What: Immediate teaching labs in Python/Julia implementing average-reward RVI Q-learning for MDPs/SMDPs; exercises on stepsize/asynchrony design; demonstrations of convergence to a solution set vs. unique point under the paper’s strengthened conditions.
- Tools/workflow: Open-source reference implementation; syllabus modules on Borkar–Meyn, asynchronous SA, and average-reward RVI; diagnostic plots in ODE-time to visualize shadowing behavior.
- Assumptions/dependencies: Finite problems; simulated environments; simple f choices (e.g., affine or max) to meet SISTr and Lipschitz.
- Model-free analysis and benchmarking for average-reward problems — Academia, Industry R&D
- What: Use generalized f design to compare multiple estimators/critics as monotone composites to stabilize average-reward learning; benchmark across queuing/scheduling suites where SMDPs are natural.
- Tools/workflow: Compose f via ψ(g1,…,gm) as in the paper, enabling richer, domain-informed reward-rate aggregation without losing convergence properties.
- Assumptions/dependencies: ψ and gi satisfy Lipschitz and strict monotonicity; scaling limits exist; data supports second-moment bounds.
Long-Term Applications
These opportunities need further theoretical extension, scaling, or integration before routine deployment.
- Deep function approximation for large/continuous SMDPs — Robotics, Autonomy, Telecom, Energy
- What: Extend RVI Q-learning to neural approximators for Q and T in high-dimensional or continuous spaces with event-driven dynamics, preserving SA-style stability.
- Potential products/workflows:
- Deep RVI modules with target networks and regularizers tuned to approximate Lipschitz behavior and bounded bias in T estimates.
- Automated “Lh budgeting” via conservative lipschitzization (spectral normalization, gradient penalties).
- Dependencies/assumptions: New proofs for function approximation; exploration management (coverage) at scale; robust estimation of holding times; stability under replay and nonstationarity.
- Real-time continuous-time control where actions change within holding times — Cyber-Physical Systems, Healthcare ICU control, Traffic flow
- What: Move beyond standard SMDPs to settings requiring action updates between transitions; use Markov chain approximation methods with piecewise-constant actions and quantized controls.
- Tools/workflow: Pipelines that discretize continuous-time control into SMDPs approximations; average-reward HRL layers for supervisory decisions with variable durations.
- Dependencies/assumptions: Approximation accuracy; policy realization constraints; validation that the derived SMDP retains weak-communication and AOE solution-structure akin to Lemma 1.
- Multi-agent average-reward SMDPs — Distributed Systems, Logistics, Smart Grids
- What: Apply asynchrony-aware RVI learning across agents with interacting durations (τ) and local rewards; coordinate to optimize global average reward rate.
- Products/workflows: Federated/asynchronous updates with per-agent stepsize and update-ratio governance; ψ-based f to aggregate multiple agent-level critics.
- Dependencies/assumptions: Nonstationarity-induced violations of SA assumptions; need for stability analysis with moving targets; communication constraints.
- Regulatory standards and procurement guidance for RL in public infrastructure — Policy
- What: Create procurement checklists requiring convergence evidence for average-reward event-driven systems; mandate asynchrony/stepsize/instrumentation plans and verifiable lower bounds on holding times.
- Tools/workflow: Compliance templates tied to Borkar–Meyn criteria; ODE-time shadowing diagnostics; minimum data-quality standards for (s,a,τ,r,s′) logging.
- Dependencies/assumptions: Institutional capacity to audit RL systems; domain-specific definition of acceptable exploration and stationarity windows.
- Automated verification and monitoring suites for average-reward RL — MLOps, Safety
- What: Build tooling that validates Assumptions (stepsize decay, update ratios, noise bounds), estimates Lh, tracks δn decay, and monitors convergence neighborhoods in ODE-time.
- Products/workflows: SDKs that plug into RL training loops to certify that the run satisfies the conditions guaranteeing convergence to a set or a unique point.
- Dependencies/assumptions: Reliable estimation of holding-time biases; accurate logging of per-(s,a) update counts; sustained stationarity during monitoring windows.
- Option discovery and hierarchical scheduling with guaranteed average-reward improvement — HRL Platforms
- What: Use the paper’s RVI backbone to evaluate and improve semi-Markov options; derive option-level policies with per-unit-time optimality.
- Products/workflows: HRL toolkits with option termination-time estimation and average-reward evaluators; ψ-based f to integrate intra-/inter-option critics.
- Dependencies/assumptions: Stable discovery under partial observability; guarantees under function approximation; sufficient option exploration.
- Finance: time-aware execution and market-making policies — Trading Systems
- What: Optimize average return per unit time where event inter-arrivals (trades, quotes) are random; exploit SMDP framing for microstructure-aware policies.
- Products/workflows: Backtesting platforms using RVI Q-learning with holding-time estimation from tick data; compositional f combining PnL/QoS critics.
- Dependencies/assumptions: Market nonstationarity; risk constraints not modeled in average reward; strong testing and guardrails; possible need to combine with risk-sensitive criteria.
- Certification for embedded RL in time-sensitive controllers — Edge/IoT, Real-time OS
- What: Enforce provable stability for learning controllers that adapt to variable task durations (schedulers, power managers) and target average throughput or efficiency.
- Products/workflows: “Certifiable RL” profiles with fixed stepsize families, measurable asynchrony schedules, and conservative Lh bounds derived from known minimum task durations.
- Dependencies/assumptions: Hard real-time constraints; certification overhead; bounded disturbances; auditability of learning traces.
Cross-cutting assumptions and dependencies
- Model class: Finite state and action spaces (paper’s convergence proofs); extension to function approximation is a research frontier.
- Communication structure: Weakly communicating SMDPs (constant optimal average reward across initial states) or single recurrent class under policies that try every action with positive probability.
- Data/observability: Access to (s,a,τ,r,s′); finite second moments; ability to estimate expected holding times with decreasing bias (βn schedule, ηn→0).
- Algorithmic schedules: Diminishing stepsizes (αn, βn), partial asynchrony with sufficient per-(s,a) update frequency; for unique-solution convergence, additional constraints on stepsize class and update-ratio regularity.
- Environment stability: Assumptions are asymptotic; strong distributional shifts or nonstationarity can violate guarantees; monitoring and revalidation recommended.
- Tuning: Thresholds involving the Lipschitz constant Lh must be respected; in SMDPs, a practical upper bound for Lh can be computed from a known lower bound on the minimum expected holding time; in MDPs (unit τ), εn=0 simplifies requirements.
Collections
Sign up for free to add this paper to one or more collections.