Program-Centric Scheduling
- Program-centric scheduling is an approach that treats program-level structures—such as packet flows, loop nests, and quantum circuits—as first-class entities to drive proactive and precise resource allocation.
- It utilizes static analysis, behavioral modeling, and runtime instrumentation to extract actionable features, enabling anticipatory scheduling decisions before resource contention occurs.
- Empirical implementations show significant gains in throughput, latency reduction, and fairness across diverse domains like networking, many-core systems, quantum computing, and LLM inference.
Program-centric scheduling is an approach wherein scheduling decisions are tightly informed by the explicit structure, semantics, or predicted behavior of program-level entities—such as packets, agentic workflows, computational loops, quantum circuits, or GPU kernels—rather than relying solely on resource-level signals (e.g., performance counters) or simple, oblivious job queues. By elevating the program itself to a first-class scheduling object, program-centric schedulers can exploit deep application knowledge for optimized throughput, resource fairness, latency minimization, and dynamic adaption in heterogeneous and distributed environments.
1. Formalizations and Canonical Abstractions
The program-centric paradigm manifests through explicit abstractions per domain:
- Network Scheduling: The PIFO (push-in first-out queue) abstraction formalizes packet scheduling. Each packet is enqueued with a scalar rank—computed by a programmable ordering function—enabling a scheduler to realize arbitrary per-packet logic in hardware without redesign (Sivaraman et al., 2016).
- LLM Agent Systems: The “LLM Program” abstraction encapsulates all state required for a multi-turn agentic workflow—context length, per-tool environments, phase, and executor bindings—allowing global, program-aware scheduling and reclamation of both GPU and tool resources (Kang et al., 14 Feb 2026).
- Many-core Throughput Scheduling: Loop nests and program regions, classified via static analysis and instrumented “beacons,” serve as program-level scheduling units for proactive decision-making, supplanting reactive hardware-centric signals (Mururu et al., 2021).
- Quantum Job Scheduling: Each quantum job is modeled as a triplet (circuit, shots, submission time), with scheduling guided by program-derived features such as circuit width and resource connectivity constraints (Wu et al., 2024).
- Heterogeneous Core and Multi-GPU Scheduling: Compiler-extracted program phases (e.g., function-level regions) or high-level parallel constructs (e.g., kernel launches) are scheduled based on static or learned behavioral features (Novaes et al., 2019, Knorr et al., 13 Mar 2025).
These abstractions unify the representation of heterogeneous workloads, enabling scheduling decisions that transcend single-resource or per-request policies.
2. Mechanisms for Program-Centric Analysis and Instrumentation
Central to program-centric scheduling is advanced analysis or instrumentation to extract actionable program features:
- Static Feature Extraction: Compilers partition code into regions or phases with similar syntactic densities—such as memory, compute, or IO intensity—enabling coarse or fine-grained phase tagging (Novaes et al., 2019). Loop classification further distinguishes between streaming and reuse patterns for cache or bandwidth-sensitive scheduling (Mururu et al., 2021).
- Behavioral Modeling: Predictive models (e.g., tree classifiers for loop trip count, linear regressors for times, MLQ tables for phase-action mappings) are trained offline and evaluated online to anticipate resource consumption or episode duration (Mururu et al., 2021, Novaes et al., 2019).
- Runtime Instrumentation: Lightweight beacons log program events (loop entry/exit, predicted memory footprint, etc.) to a scheduler, while LLM agent systems track execution context, tool state, and phase for each workflow (Kang et al., 14 Feb 2026, Mururu et al., 2021).
- Graph-based Intermediate Representations: For distributed high-level parallel programs, instruction graphs (IDAGs) decompose operations into fine-grained instructions annotated with dependencies, buffer regions, and concurrency structure, replacing monolithic command DAGs (Knorr et al., 13 Mar 2025).
The above mechanisms provide the scheduler with real-time, high-fidelity signals tailored to the intrinsic structure and dynamics of the actual program workload.
3. Scheduling Algorithms and Control Policies
Program-centric schedulers leverage the extracted features using a variety of algorithmic strategies:
- Programmable Function-based Scheduling: In network switches, scheduling order and eligibility time are provided as operator-programmed functions and per packet, enabling implementation of strict priority, WFQ, token-bucket shaping, HPFQ, LSTF, or custom algorithms at line rate via the PIFO abstraction (Sivaraman et al., 2016).
- Proactive, Predictive Multiprogramming: Many-core schedulers couple beacon predictions with a two-mode (reuse/stream) scheduler to avoid last-level cache thrashing or DRAM bandwidth overload. Admission and preemption decisions are made before actual contention occurs, based on anticipated region demands (Mururu et al., 2021).
- Reinforcement-Learning Adaptive Scheduling: Hardware phase and program phase signals jointly determine state; RL agents learn over Q-tables or neural nets to select optimum hardware configurations for each state, balancing throughput and energy dynamically (Novaes et al., 2019).
- Prioritized, Resource-constrained Packing: Quantum job scheduling assigns a scalar priority to each job as a weighted sum of normalized circuit width, number of shots, and arrival age, then packs jobs by maximizing qubit utilization while observing fidelity and connectivity constraints (EPST) (Wu et al., 2024).
- Instruction Graph Out-of-Order Execution: Multi-GPU systems maintain IDAGs of micro-ops (allocation, copy, kernel, send/recv), allowing concurrent scheduling of independent operations, with graph lookahead to optimize memory allocation and minimize critical path idle time (Knorr et al., 13 Mar 2025).
- Global Program-Aware Queueing: ThunderAgent’s scheduler maintains a global program-aware queue, proactively pausing or restoring LLM programs to minimize KV-cache thrashing and disk/memory waste. Scheduling scores prioritize minimal recomputation and high-throughput rollout (Kang et al., 14 Feb 2026).
Policy implementation spans from atomic hardware pipelines (packet switches (Sivaraman et al., 2016)) to software-based feedback loops and global distributed queues (agentic inference (Kang et al., 14 Feb 2026), multi-GPU runtimes (Knorr et al., 13 Mar 2025)).
4. System Implementations and Hardware Realizations
Program-centric concepts have been deployed across hardware and software platforms:
- Programmable Switch Hardware: PIFO-based packet schedulers are realized as pipelined blocks, each supporting up to 1 GHz line-rate, organizing per-flow ranks in prioritized SRAM/flip-flop structures with a crossbar interconnect. Area overhead of five-level hierarchy is under 4% of a 200 mm² switch ASIC, supporting line rate for 64×10 GbE ports with sub-10 ns added latency (Sivaraman et al., 2016).
- Compiler/OS Co-Design for Heterogeneous Systems: Compiler-inserted phase tags and lightweight runtime instrumentation enable tight coupling with the scheduler. Reinforcement learning agents and beacon aggregators operate with <5% runtime overhead and binary footprint increase of 50 KB, converging in few dozen decision checkpoints (Novaes et al., 2019).
- Distributed Runtime and Concurrency Architecture: For accelerator clusters, scheduler threads construct instruction graphs in parallel with executor threads. The virtualized buffer model handles dynamic access patterns via lookahead and resource pooling, sustaining high concurrency and out-of-order dispatch even in the presence of complex dataflow (Knorr et al., 13 Mar 2025).
- Quantum Control Orchestrators: Noise-aware multi-program scheduling frameworks integrate with real superconducting QPU platforms (e.g., QuantumCTek Xiaohong), tightly coupling noise calibration data, hardware connectivity, and job features to schedule and map circuits in ALAP execution batches (Wu et al., 2024).
- LLM Inference and Tool Orchestration: ThunderAgent’s unified “LLM Program” level abstraction enables integrated scheduling of GPU context, KV cache residency, and tool sandbox resources. Global scheduling policies and hook-based reclamation prevent resource leakage and maximize throughput (Kang et al., 14 Feb 2026).
5. Performance Outcomes and Comparative Evaluation
Empirical results across domains demonstrate the practical impact of program-centric scheduling:
- Throughput and Fairness: STFQ/HPFQ in programmable switches achieve ideal max-min fairness (Jain’s index ≈1), no packet drops, and line-rate operation for up to 1,024 flows with only a small hardware penalty (Sivaraman et al., 2016).
- Resource Efficiency: Compiler-guided throughput scheduling achieves +76.78% mean throughput gain and up to 3.3× improvement over CFS across 45 many-core benchmarks; cache/bandwidth contention is proactively avoided (Mururu et al., 2021). Astro outperforms ARM GTS in both runtime and energy on most benchmarks (Novaes et al., 2019).
- Latencies and Scaling: IDAG-based multi-GPU runtime scales to 128 GPUs (N-body: 27% improvement over baseline, RSim: 460% improvement), with scheduling latency rendered negligible compared to parallel execution (Knorr et al., 13 Mar 2025).
- Quantum Resource Turnaround: NAQJS achieves 81–84% QPU time reduction and ~90% average turnaround savings on Xiaohong hardware, with only a small decrease in fidelity (≤5%) and seven-fold reduction in required shots (Wu et al., 2024).
- Agentic Inference: ThunderAgent delivers 1.5–4× throughput improvement in LLM agentic workloads, up to 4.2× disk memory savings relative to vLLM+K8s, and near-100% cache hit rates on predictable workloads (Kang et al., 14 Feb 2026).
Concrete benefits accrue from combining program-level insights with dynamic scheduling: program-centric approaches consistently outperform resource-agnostic or purely reactive baselines.
6. General Principles, Limitations, and Broader Implications
Program-centric scheduling encapsulates several key principles:
- Abstraction Unification: Exposing programs or operations as first-class objects allows resource managers to reason over more meaningful scheduling units, accommodating both semantic goals and hardware realities.
- Predictive and Proactive Control: By anticipating (rather than reacting to) resource contention, program-centric schedulers can minimize detection lag, avoid pathological interference, and better balance throughput, fairness, energy, or fidelity.
- Programmability and Verifiability: Domain-specific languages (Domino for packet schedulers), compiler passes, or ML policies empower operators to realize and evolve new scheduling algorithms without hardware redesign or low-level hackery.
- Efficient Resource Reclamation: Directly coupling program life cycles (e.g., LLM workflow state, quantum job completion) to resource hooks prevents leakage and mediates contention at fine granularity (Kang et al., 14 Feb 2026).
- Overhead and Scalability: While added instrumentation or scheduling logic incurs some runtime and binary bloat, all reviewed systems keep this overhead under 5%, either via hardware pipelining or selective event logging.
Limitations and caveats include dependence on representative training data for predictive models, potential mispredictions on highly irregular workloads, convergence time for RL schedulers, and challenges in extending fine-grained analysis to data-oblivious regions or GPU-based code (Mururu et al., 2021, Novaes et al., 2019). Most frameworks target CPU or accelerator clusters, with only emerging exploration for other domains (quantum, agentic inference).
The overall implication is that tight integration of program semantics and dynamic resource scheduling—across networking, HPC, datacenter, language-model inference, and quantum computing domains—provides a coherent methodology for maximizing efficiency, adaptability, and fairness in heterogeneous, large-scale environments.