Preemptive-Priority Queuing System

Updated 22 January 2026

Preemptive-Priority Queuing System is a model that assigns jobs varying priority levels, allowing higher-priority tasks to preempt lower ones for immediate service.
It is analyzed using methodologies like M/G/1 models and continuous-priority distributions to quantify waiting times and employ preemptive-resume or preemptive-repeat rules.
Its practical applications in inference serving, GPU scheduling, and telecommunications help achieve significant reductions in latency and enhanced service differentiation.

A preemptive-priority queuing system is a class of queueing models in which jobs are ascribed varying priorities at arrival, and preemption mechanisms ensure the highest-priority jobs in the system always receive immediate service. Lower-priority tasks may be interrupted and either resume service later or may be discarded, depending on the specific model. This mechanism is fundamental to latency-sensitive service differentiation in computer systems, telecommunications, inference engines, and control over shared compute resources such as GPUs. Theoretical and applied analyses encompass M/G/1 and M/G/c structures, Markovian and general service times, discrete and continuous priority spectra, retrial and multi-server settings, and strategic customer behaviors.

1. System Architectures and Fundamental Models

Preemptive-priority systems instantiate a broad taxonomy, ranging from classical two-class (high/low) priorities to infinite-dimensional continuous-priority spaces.

Classical (discrete, finite-class) models In the canonical M/G/1, M/M/c, or M/M/1+GI formats, jobs are partitioned into high-priority (class 1) and low-priority (class 2). Arrivals are Poisson (rate $\lambda_i$ ), service times are typically exponential (rate $\mu_i$ ) but more generally i.i.d. from $G_i$ , and at every instant, the server processes a customer of the smallest class index with the highest priority (Chamberlain et al., 2020, Selen et al., 2016, Chen et al., 2015, Tatashev et al., 2022).
Continuous-priority models Upon arrival, each job is assigned a priority sampled from a continuous distribution (usually uniform on $[0,1]$ ), yielding a measure-valued process for the system's state. Scheduling always serves the currently present job with maximal priority; arrivals of strictly higher priority preempt the one in service, whose residual service remains memoryless if service times are exponential. The process is then infinite-dimensional, modeled as, for instance, $x_t(B)$ : count of jobs at time $t$ whose priorities lie in set $B \subset [0,1]$ (Master et al., 2016, Master et al., 2016).
Multi-server and retrial models Systems with $s$ servers (M/M/s) extend the rule: at all times the $s$ jobs of highest priority are in service. Under retrial and preemptive-repeat semantics, preempted jobs may join an orbit (infinite retrial queue) or be lost, and arrivals may follow Markov-modulated (MMAP) processes (Raj et al., 2022, Raj et al., 2021).
Job structure and service phases Jobs may be composed of phases (e.g., prefill and decode in MoE inference), and the system manages separate FIFO queues per phase and priority (Siavashi et al., 12 Mar 2025). In real-time GPU scheduling, tasks alternate between CPU and GPU segments, with per-segment priority preemption enforced at the scheduler or driver level (Wang et al., 2024).

2. Preemption Rules and Scheduling Algorithms

The preemptive discipline modifies classical queue operations:

Preemptive-resume On arrival of a priority- $p$ job, if the server is busy with a job of strictly lower priority, the current task is preempted and stored with its residual service. Service resumes when no higher-priority jobs remain. In M/G/1 preemptive-resume, the service time distribution does not reset and resumes exactly at the point of interruption (Chamberlain et al., 2020, Selen et al., 2016, Tatashev et al., 2022, Siavashi et al., 12 Mar 2025).
Preemptive-repeat Preempted jobs, rather than resuming service from the preemption point, restart service with a fresh sample from the service-time distribution upon their eventual return to service (Raj et al., 2021, Raj et al., 2022).
Omission or discard In some M/G/1/1 or age-of-information scenarios, preemption leads to outright loss of the preempted low-priority job (Najm et al., 2018).
Algorithmic implementation In practical settings, such as mixture-of-experts LLM serving, fine-grained expert-layer FIFO queues and task tracking structures are essential. For MoE, QLLM introduces four logical queues per job class and phase, with expert-level subqueues, and priority-aware preemption is implemented by checkpointing in-flight lower-priority experts and re-enqueueing their state (Siavashi et al., 12 Mar 2025).

// Example: QLLM main loop, batch scheduler
while true:
    event = wait_for_event()
    if event == new_arrival and new_job.priority == LS:
        enqueue(LS_Prefill, new_job)
        engine.preempt_lower_priority()
    if engine.is_idle():
        batch = GetNextBatch()
        if batch is not None:
            engine.launch(batch)
// See detailed pseudocode and queue structure in [2503.09304]

GPU/process scheduling In real-time systems, preemptive priorities are controlled either by driver-level kernel threads (polling and runlist modification) or by explicit IOCTL calls at task segment boundaries, both of which update the set of active contexts to reflect current priority (Wang et al., 2024).

3. Performance Analysis and Queueing Theoretic Results

The performance and stability of preemptive-priority queues are addressed rigorously.

M/G/1 preemptive-resume (two-class) Waiting times for class 1 (high) and class 2 (low) using standard formulas (Chamberlain et al., 2020):

$E[W_1] = \frac{\lambda\,E[S^2]}{2(1-\rho)}, \qquad E[W_2] = \frac{\lambda\,E[S^2]}{2(1-\rho)(1-\phi\rho)}$

Here, $\phi$ is the fraction of premium customers, $\lambda$ is the aggregate arrival rate, and $\rho = \lambda / \mu$ .

Preemptive-priority MoE scheduling The scheduler is modeled as a preemptive-priority M/G/1, where LS jobs correspond to "high priority" and BE to "low priority." For LS jobs (class $H$ ), queueing delay collapses to:

$E[W_H] = \frac{\lambda_L E[S_H^2]}{2(1-\rho_H)}$

For BE jobs:

$E[W_L] = \frac{\rho_H E[S_L]}{1-\rho_H} + \frac{\lambda_B E[S_B^2]}{2(1-\rho_H)}$

where $\rho_H = \lambda_L E[S_L]$ and $B$ indicates batch size (Siavashi et al., 12 Mar 2025).

Continuous-priority, infinite-dimensional (M/M/1) For a job of priority $p \in (0,1)$ in a system with aggregate load $ρ = λ/μ$ , the stationary expected sojourn time and waiting time are:

$s(p) = \frac{1}{[1 - (1-p)\rho]^2}, \qquad w(p) = s(p) - 1/\mu$

In overload ( $λ > μ$ ), low-priority levels $p < p^* = 1 - μ/λ$ see unbounded delay (Master et al., 2016).

Multi-server, finite and infinite-dimensional For M/M/s with continuum priorities, the expected delay for a priority $p$ can be explicitly calculated from the M/M/s sub-system with reduced arrival rate $(1-p)\lambda$ (Master et al., 2016).
Retrial and preemptive-repeat metrics In retrial systems with preemptive-repeat, steady-state probabilities for system states and orbit size distributions can be retrieved from the stationary vector of the system's block-structured Q-matrix, amenable to matrix-analytic and metaheuristic optimization approaches (Raj et al., 2021, Raj et al., 2022).

4. Applications and System-Level Implications

Preemptive-priority queues underpin critical infrastructure in both theoretical analysis and real-world deployments:

Inference serving and LLMs Fine-grained, preemptive-priority schedulers for inference workloads (QLLM) enable LS jobs to preempt BE jobs at expert/layer boundaries in MoE architectures, vastly reducing time-to-first-token (TTFT) and improving SLO compliance while marginally impacting BE job throughput. Empirically, average LS TTFT improved by $65.5\times$ , and the system maintained SLOs at $7$ req/s, outperforming FCFS baselines (Siavashi et al., 12 Mar 2025).
GPU/heterogeneous computing Real-time preemptive-priority resource scheduling on GPUs requires driver-level control (runlist manipulation), with demonstrated 40% gains in schedulability on embedded platforms such as Nvidia Jetson when replacing fixed round-robin or non-preemptive critical sections (Wang et al., 2024).
Telecom, cognitive radio, and retrial systems In underlay cognitive radio networks, preemptive transmission with classwise impatience enables systemic control of mean waiting and reneging probabilities for priority users, optimizing MAC performance (Chen et al., 2015, Azarfar et al., 2012).
Networks with traffic differentiation Preemptive-resume disciplines in multi-class traffic environments yield superior service differentiation but worsen lower-priority metrics, especially under interruptions or retrials, as confirmed in OSA and catastrophic recovery scenarios (Azarfar et al., 2012, Raj et al., 2022).

Preemptive-priority systems have significant implications for welfare and strategic optimization:

Equilibrium and Price of Anarchy When queue participation is strategic (premium class with fee $C$ ), the joining fraction $\phi$ and operator fee can be optimized for social welfare or revenue. For two-class M/G/1 preemptive-resume, a unique, stable mixed Nash equilibrium exists under service-time variability $c_v^2>1$ and loads below a threshold (Chamberlain et al., 2020). The Price of Anarchy (PoA) is proven to be bounded by $4/3$ at equilibrium, meaning system inefficiency due to selfish behavior is strictly limited under all parameter regimes.
Revenue dominance Preemptive-resume outperforms non-preemptive policies in provider revenue, with the maximum achievable revenue under PR always exceeding NP for all loads and service time variances, and the optimal fee may induce only a fraction of arrivals to upgrade (Chamberlain et al., 2020).
Multi-objective resource optimization Systems with retrials and multiple preemptive levels (including disaster modes) are optimized via multi-objective heuristics (e.g., NSGA-II), balancing high-priority loss rates, backup capacity, and preemption aggressiveness (Raj et al., 2022). Metaheuristics (PSO, SA, Direct Search) optimize server count and handoff acceptance in telecom retrial queues (Raj et al., 2021).

6. Advanced Topics and Theoretical Insights

Exact tail asymptotics In discrete-time preemptive-priority models, intricate kernel and generating-function methods yield explicit asymptotics for the stationary distribution's tails, with low-priority queue lengths decaying geometrically or subgeometrically, and high-priority marginals often retaining geometric tails. These results directly quantify rare-event and extreme-queue-length probabilities (Song et al., 2014).
Continuous priority, heavy-traffic, and bifurcation phenomena As load approaches criticality, preemptive-priority systems exhibit bifurcations where only top-priority levels maintain finite delay, and lower-priority jobs see diverging sojourn times. This boundary is explicit in continuous-priority systems ( $p^* = 1 - μ/λ$ ) (Master et al., 2016, Master et al., 2016).
Impact of preemptive structure on information freshness For age-of-information applications, while preemption always minimizes age for high-priority streams, it can degrade age for low-priority streams relative to FCFS—particularly in coupled or small-buffer systems (Najm et al., 2018).

In sum, the preemptive-priority queuing paradigm is mathematically mature and operationally foundational for systems requiring fine-grained, strict prioritization under heterogeneous workloads, ranging from ML inference to telecom control. Its performance, welfare, and stability properties are fully determined by the preemption discipline, arrival and service characteristics, retrial/abandonment processes, and, in strategic domains, user equilibrium. Advanced analytical and numerical methods now allow accurate prediction, robust control, and welfare-optimized operation over diverse preemptive-priority systems.