Papers
Topics
Authors
Recent
2000 character limit reached

SchedTwin: Adaptive Digital Twin Scheduling

Updated 28 December 2025
  • SchedTwin is a digital twin framework that integrates real-time adaptive scheduling with high-fidelity system modeling for HPC and edge computing.
  • It employs parallel what-if simulations to evaluate candidate scheduling policies using concrete performance metrics, thereby optimizing resource allocation.
  • The framework is extensible, supporting energy-aware and multi-objective scheduling across diverse applications, including vehicular edge computing.

SchedTwin is a class of digital twin frameworks that unify high-fidelity system modeling with real-time or simulated adaptive scheduling mechanisms, primarily for high-performance computing (HPC) and edge computing contexts. Distinct from conventional static scheduling policies, SchedTwin architectures employ event-driven digital twins that mirror the system state, evaluate multiple candidate scheduling policies via parallel what-if simulations, and select the policy that optimizes a composite utility metric subject to dynamic workloads and administrator objectives. SchedTwin has been realized in HPC batch-scheduling (notably as a PBS-integrated real-time twin) (Zhang et al., 21 Dec 2025), as a meta-simulation and scheduling platform driving energy- and cooling-aware infrastructure analysis (Maiterth et al., 27 Aug 2025), and in digital-twin-assisted vehicular edge computing for joint scheduling of twin-maintenance and computation (Xie et al., 10 Jul 2024).

1. Architectural Components and Framework Variants

At its core, SchedTwin couples a physical or simulated scheduler (the production or trace-replayed system) to a parallel framework of digital twins:

  • Cluster-based SchedTwin for Adaptive Batch Scheduling (PBS Integration):
    • Physical Scheduler: Production PBS daemon manages job queues and node assignments.
    • Event Streamer: Hooks on queuejob, runjob, and jobobit events generate structured metadata (job IDs, timestamps, resource vectors) forwarded to a Redis event stream.
    • SchedTwin Controller: Maintains an in-memory mirror of the cluster state; orchestrates periodic scheduling cycles on event triggers.
    • Predictive Simulator Pool: Multiple CQSim-based discrete-event simulators, each configured with different candidate policies.
    • Decision Feedback Module: Direct job launches via qrun commands, without altering scheduler internals (Zhang et al., 21 Dec 2025).
  • Meta-Twin Simulation Framework (S-RAPS/ExaDigiT):
    • Incorporates transient thermo-fluid cooling models (Modelica), resource-allocator/power simulator, dataloaders for historical and synthetic job traces, scheduler APIs for native and external policy plugins, and comprehensive accounting (Maiterth et al., 27 Aug 2025).
  • Vehicular Edge Computing SchedTwin (DT-VEC):

These architectures enable both real-time closed-loop scheduling and simulation-anchored what-if policy analysis.

2. Data Ingestion and Digital Twin State Synchronization

SchedTwin frameworks depend on robust, low-latency event ingestion and digital twin state fidelity:

  • Event Typology & Temporal Semantics:
    • All frameworks process time-stamped events corresponding to job arrivals, dispatches, and completions (in HPC) or task/twin-update arrivals and transmissions (in VEC).
    • Each event is structured as a tuple—e.g., E=(type,job ID,timestamp,r)E = (\mathrm{type}, \mathrm{job~ID}, \mathrm{timestamp}, \vec{r})—streamed to in-memory queues or Redis, ensuring rapid and ordered state reflection (Zhang et al., 21 Dec 2025).
  • State Adjustment Mechanisms:
    • On receipt of actual job end events or vehicle twin task completions, digital twin models reconcile guessed end times with observed realities, accounting for user runtime misestimation and system-level latencies (Zhang et al., 21 Dec 2025, Xie et al., 10 Jul 2024).
  • Synchronized Resource Views:
    • Node (or resource) availability, queue lengths, and running job/task sets are mirrored with fine-grained accuracy to support predictive simulation and optimality of scheduling decisions (Zhang et al., 21 Dec 2025).

3. Predictive Simulation and Policy Evaluation

SchedTwin’s distinguishing feature is parallel, high-fidelity simulation of future system trajectories under multiple candidate scheduling policies, conducted at each scheduling point:

  • Discrete-Event Simulation Core:
    • State machines capture the waiting queue Q(t)Q(t), running jobs R(t)R(t), and free resources A(t)A(t).
    • Event queue sorting and time-skipping ensure computational efficiency (Zhang et al., 21 Dec 2025).
  • Policy Parametrization:
  • Performance Metrics:
    • Per-policy metrics include wait time WTP(j)WT_{P}(j), turnaround TTP(j)TT_{P}(j), slowdown SDP(j)SD_{P}(j), system utilization UPU_P (Zhang et al., 21 Dec 2025).
    • VEC SchedTwin analogously models data transmission and computation delays for both twin maintenance and computing tasks, including uplink channel modeling, task CPU requirements, maximum tolerable deadlines, and weighted satisfaction functions (Xie et al., 10 Jul 2024).
  • Composite Scoring and Selection:
    • For HPC, a weighted sum score:

    Score(P)=w1maxWT(P)+w2maxSD(P)+w3avgWT(P)+w4avgSD(P)Score(P) = w_1 \mathrm{\,maxWT}(P) + w_2\mathrm{\,maxSD}(P) + w_3\mathrm{\,avgWT}(P) + w_4 \mathrm{\,avgSD}(P)

    with administrator-definable weights wiw_i. - Policy with the optimal score PP^* is selected; tie-breaking is protocolized (WFP >> FCFS >> SJF) (Zhang et al., 21 Dec 2025).

  • Parallelization and Overhead:

    • All k policies are evaluated in parallel, with decision latency limited by the slowest simulation branch; measured overhead is only a few seconds per cycle on modern hardware (Zhang et al., 21 Dec 2025).

4. Extensions to Incentive Structures, Sustainability, and VEC Resource Coordination

SchedTwin platforms provide rich extensibility for advanced scheduling objectives.

  • HPC Infrastructure Simulation:
    • SchedTwin (S-RAPS) integrates power and cooling system models, enabling not just job-level, but facility and energy-aware what-if scheduling studies.
    • Incentive structures supported include account-based power usage rewards, energy-delay product (EDP) minimization, and emulation of real-priority and point schemes (e.g., Fugaku PTS).
    • ML-guided scheduling employs multi-objective scoring functions on predicted/statistical job features with tunable priorities:

    S(Xi)=j=1Kαjexp(1/(Xij+1))S(X_i) = \sum_{j=1}^K \alpha_j\exp{\left(1 / (\sqrt{X_i^j}+1)\right)}

    (Maiterth et al., 27 Aug 2025). - Physical model variations (power, cooling, carbon, cost) provide multidimensional evaluation and optimization capabilities.

  • Vehicular Edge SchedTwin (MADRL-CSTC):

    • Formalizes the joint scheduling of DT maintenance and compute task simulation as a multi-agent Markov Decision Process (MDP).
    • Each vehicle agent selects CPU frequency allocations for DT versus task processing and vehicle transmit powers; reward is the aggregate satisfaction Ui=αQidt+(1α)QitkU_i = \alpha Q_i^{\rm dt} + (1-\alpha) Q_i^{\rm tk} (Xie et al., 10 Jul 2024).
    • The MADRL-CSTC algorithm employs centralized training with decentralized execution (CTDE), actor-critic networks, experience replay, soft target updates, and converges to high resource utilization and superior per-vehicle utility relative to single-agent and random baselines.

5. Empirical Evaluation, Baselines, and Key Outcomes

SchedTwin frameworks have been empirically validated across heterogeneous scenarios:

Setting Baselines SchedTwin Outcome
PBS HPC Cluster (Zhang et al., 21 Dec 2025) FCFS/EASY, WFP, SJF 11.4% performance improvement over WFP, lowest wait/slowdown, multi-policy adaptive behavior
S-RAPS/ExaDigiT (Maiterth et al., 27 Aug 2025) FCFS, SJF, LJF, Priority (+ backfill, ML) ML-guided DT yields: 20% lower average wait, 12% lower turnaround, 8% lower EDP, energy/cost/ramp rate metrics accessible
DT-VEC (Xie et al., 10 Jul 2024) SAC, PPO, Random 95–100% resource utilization, 20–30% higher per-vehicle utility, all delays within deadlines

Experimental designs range from synthetic phased workloads and public production traces (Frontier, Marconi100, Fugaku, Lassen, Adastra) to high-mobility vehicular edge scenarios with real communication and mobility models. In all cases, SchedTwin adapts policy selection to evolving workload and system dynamics, outperforming static policies and providing insight into trade-offs (e.g., utilization vs. fairness, energy vs. delay) (Zhang et al., 21 Dec 2025, Maiterth et al., 27 Aug 2025, Xie et al., 10 Jul 2024).

6. Implementation Characteristics and Open Challenges

  • Codebases and Integration:
    • SchedTwin for PBS builds atop CQSim (∼1,000 LoC extension), Python/Redis for event streaming, bash/Python hooks for state mirroring (Zhang et al., 21 Dec 2025).
    • SchedTwin S-RAPS provides plugin APIs for custom schedulers (ScheduleFlow, FastSim), empirical power/cooling profiles, and trace dataloaders.
    • DT-VEC code instantiates per-agent actor/critic networks with contemporary deep RL architectures and adheres to an experience-replay, minibatch-based update protocol.
  • Practical Challenges:
    • Runtime estimation inaccuracies require dynamic correction in simulators.
    • Ensuring alignment of digital twin and physical/historical state under node outages and delayed cleanups.
    • Real-time synchronization imposes wallclock overhead only seconds per cycle, but synchronous operation with certain external simulators can introduce blocking.
    • SchedTwin does not yet generally support advance reservations, workflow dependencies, or closed-loop feedback with physical telemetry in all use cases.
  • Open Source and Extensibility:
    • Core SchedTwin code, including event hooks and scheduler extensions, is available under a permissive license (SPEAR Lab/CQSim, “adaptive_scheduler” branch) (Zhang et al., 21 Dec 2025).

7. Research Directions and Limitations

SchedTwin architectures are establishing a new paradigm for joint scheduling and digital twin-based operation, yet several open directions remain:

  • Achieving online, closed-loop integration of physical telemetry for continuous digital twin–guided adaptive control.
  • Generalizing to graph/DAG-based and workflow-aware scheduling for HPC and edge contexts.
  • Leveraging renewable energy and grid-aware scheduling via forecast-informed digital twin coupling.
  • Further enhancement of job-power fingerprinting and runtime prediction to mitigate data sparsity in historical traces,
  • Providing REST/gRPC APIs for scalable, remote access to digital twin–driven schedule prototyping (Maiterth et al., 27 Aug 2025).

A plausible implication is that SchedTwin methodologies will underpin HPC and edge resource management systems that are dynamic, energy-aware, and robust to workload heterogeneity and infrastructure variability, enabling exploration—and, potentially, runtime realization—of multi-objective scheduling policies not accessible with traditional methods.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SchedTwin.