Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical DRL-based Multi-Timescale Scheduling

Updated 28 January 2026
  • The paper demonstrates that hierarchical DRL strategies decompose complex scheduling tasks into manageable subproblems, reducing action space and speeding convergence.
  • It employs a multi-level MDP formulation with adaptive, time-sensitive policies to enforce safety and optimal resource allocation in dynamic environments.
  • Experimental results indicate significant improvements in task completion rates, makespan reduction, and cumulative rewards compared to baseline methods.

Hierarchical deep reinforcement learning (DRL)-based multi-timescale scheduling refers to the class of algorithms and frameworks that decompose complex, dynamic scheduling problems—often arising in multi-agent or multi-resource systems—into a hierarchy of policies or controllers, each operating at a distinct temporal and/or spatial resolution. These systems employ DRL to learn effective decision policies at each level, achieving scalable, flexible, and often safe scheduling across environments characterized by high-dimensional state/action spaces, stochastic dynamics, and partial observability (Carvalho et al., 2022, Hao et al., 2024, Ramezani et al., 2023).

1. Hierarchical Problem Formulation and Multi-Timescale Decomposition

Hierarchical DRL-based multi-timescale scheduling exploits the natural presence of temporal and spatial structure in scheduling problems, using separate DRL agents or meta-controllers for decision-making at different abstraction levels. Each level is typically formalized as a Markov decision process (MDP) or, in multi-agent contexts, as a partially observable Markov game (POMG or Dec-POMDP).

Canonical Decompositions

  1. Warehouse Task Scheduling: The high-level MDP controls agent-to-task assignment and task scheduling at a coarse timescale, while low-level agents execute schedules or respond to local contingencies at finer granularity (Carvalho et al., 2022).
  2. Mobile Edge Computing (MEC): Three-layer DRL frameworks divide the decision workload into (1) long-term service placement (e.g., cloud-resource migration), (2) mid-term task offloading/routing (e.g., among edge/cloud nodes), and (3) short-term resource (CPU) allocation within edge nodes (Hao et al., 2024).
  3. Satellite Constellations: High-level global schedulers distribute tasks among CubeSats, while low-level safety controllers make frequent energy-aware reallocation or abort decisions (Ramezani et al., 2023).

This decomposition both reduces the size of the feasible action space at each level, limiting combinatorial explosion, and separates concerns: global efficiency/objective at the top; local adaptation, constraints, or safety at lower levels.

2. Mathematical Definitions and Policy Structure

Formally, a two-level system comprises:

  • High-level MDP: MH=(SH,AH,PH,rH,γH)M_H=(\mathcal{S}_H, \mathcal{A}_H, P_H, r_H, \gamma_H) with state sHs_H encoding global or aggregate environment status, actions corresponding to schedules or assignments, and reward signals linked to global performance (e.g., task delay, system throughput, resource costs).
  • Low-level MDP/Markov Game/Dec-POMDP: MGL=([n],XL,U,PL,{rL,i},γL,Z,O)MG_L=([n], \mathcal{X}_L, \mathcal{U}, P_L, \{r_{L,i}\}, \gamma_L, \mathcal{Z}, O), where each agent ii acts on local state ziz^i and executes primitive actions (e.g., move, allocate, reassign) with local or shared rewards.

In option-style hierarchies, the high-level action defines an "option" (sub-policy or subgoal) for the lower level, which persists for several steps and is then re-evaluated, naturally enforcing a multi-timescale rollout (Carvalho et al., 2022).

Policy architectures are customized per level:

  • The high-level scheduler typically operates via recurrent or convolutional neural networks with access to global state summaries.
  • Low-level policies often use parameter sharing (shared-experience PPO, DQN) for scalable learning in multi-agent scenarios and may be conditioned on high-level options as well as local observations (Carvalho et al., 2022, Ramezani et al., 2023).

In some applications, further structure or attention-based encoders are incorporated to encode priorities or forecast resource consumption—e.g., Similarity Attention-based Encoder (SABE) and MLPs in CubeSat scheduling (Ramezani et al., 2023).

3. Core Algorithms, Training Paradigms, and Constraints

Training is primarily centralized (access to global state, joint rewards) with decentralized execution (agents act only on partial/local information). Policy gradient methods dominate, with implementations favoring Proximal Policy Optimization (PPO), actor-critic, DQN, or MADDPG variants tailored to the level/scale:

  • High-level: On-policy PPO, actor-critic (with separate value and policy heads) (Carvalho et al., 2022, Hao et al., 2024).
  • Mid/low-level: Parameter-shared PPO, independent PPO, DQN (for binary or categorical actions), or multi-agent centralized-critic actor-critic (Carvalho et al., 2022, Ramezani et al., 2023).
  • Adaptive timescales: At each layer, the policy can output both a concrete decision and a binary "update now" signal, regulating its own invocation frequency and effectively realizing adaptive control intervals (Hao et al., 2024).

Safety and feasibility are often enforced via action masking: infeasible actions (e.g., those violating task preconditions, resource capacity, or violating safety buffers) are masked out by setting action logits to ≪0\ll 0, ensuring constraint satisfaction without extrinsic penalties (Hao et al., 2024, Ramezani et al., 2023).

4. Multi-Timescale Dynamics and Decentralization under Partial Observability

Operation at multiple timescales is realized in several modes:

  • Hierarchical options: The high-level scheduler updates only every kk low-level steps, with each update triggering execution of a new batch of options or assignments.
  • Learned adaptive intervals: Each policy at each layer autonomously decides when to update, based on workload, load variance, or delay signals, allowing for fully asynchronous and context-sensitive operation without pre-fixed ratios (Hao et al., 2024).

In all settings, decentralization is supported at the lower levels: agents rely only on partial observability (local v×vv \times v grid, resource state, buffer/task queue), with no online communication during inference. When central scheduling is ablated, agents may share a single policy for distributed inference, but these variants typically plateau at lower performance (Carvalho et al., 2022, Hao et al., 2024).

Partial observability is handled without recurrent neural networks by leveraging the Markovian assumption over the observation window plus the persistent effect of assigned subgoals (e.g., the explicit option/schedule) as a working memory (Carvalho et al., 2022).

5. Key Experimental Protocols, Baselines, and Quantitative Results

Summary of Experimental Settings

Setting Environment Agents Timescales Policy Types
Warehouse scheduling 10×1010 \times 10 grid 2–8 k=1,2,4k=1,2,4 PPO; hierarchical options
MEC EdgeTimer 4–12 clusters (K8s) per Edge 3 layers (adaptive) Actor-critic, action masking
CubeSat constellation 3–5 CubeSats per Cube assign vs. safety loop MADDPG high, DQN low, SABE, MLP

Baselines include random scheduling, single-level DRL (MADDPG), static single- and multi-timescale rules, delay/workload-triggered updates.

Key outcomes:

  • Hierarchical DRL outperforms single-level and random strategies, with statistically significant gains in task completion rate, makespan, and cumulative return. For example, CubeSat HierRL achieved 93–95% completion versus 72–87% for baselines, and a 10–20% reduction in makespan (Ramezani et al., 2023).
  • In EdgeTimer, adaptive timescale policies achieved profit improvements of 1.3×1.3 \times–9.1×9.1\times versus static baselines, without cost to task delay guarantees (≥99%\geq99\% deadline adherence) (Hao et al., 2024).
  • Shared-experience policies at low level enable both faster convergence and higher average reward compared to independent learning modes (Carvalho et al., 2022).
  • Pre-training low-level policies is critical: high-level scheduler learning collapses in the presence of non-stationary low-level behaviors (Carvalho et al., 2022).

6. Design Principles and Practical Considerations

Several design principles emerge from the comparative analyses:

A plausible implication is that such frameworks may generalize with minimal adaptation to domains exhibiting hierarchical temporal or control structure—for instance, data center task management, UAV/robotic fleet control, and beyond.

7. Limitations, Ablations, and Open Directions

While hierarchical DRL multi-timescale scheduling demonstrates substantial benefits, the literature reports several limitations:

Future research may investigate more flexible hierarchical decompositions, explicit memory-augmented policies, and unified safety/reliability guarantees under adversarial or highly non-stationary workload conditions.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical DRL-based Multi-Timescale Scheduling.