Shared Task Queues in Parallel Systems

Updated 11 January 2026

Shared task queues are defined as a data structure that stores and manages computation tasks for concurrent execution, balancing key performance metrics.
They employ per-domain, work-stealing, and distributed designs to optimize throughput, load balancing, and data locality in parallel and distributed systems.
Analytical models and benchmarks show that careful tuning of scheduler parameters can achieve near-optimal scalability with minimal task overhead.

A shared task queue is a data structure or execution paradigm in which a collection of tasks—computation jobs, message units, or resource requests—are stored and managed in a manner that allows multiple processing entities (threads, processes, agents, nodes) to concurrently dequeue and execute them according to a specified policy. Shared task queues are foundational in parallel computing, high-performance scientific applications, distributed cloud services, operating systems, manufacturing systems, and modern data-center orchestration. Their design trades off properties such as throughput, load balancing, data locality, synchronization overhead, fairness, and order guarantees. Approaches range from tightly coupled thread-level queues to fully distributed, sequentially consistent multi-node protocols.

1. Architectures and Core Principles of Shared Task Queues

The architecture of shared task queues distinguishes between centralized and distributed schemes, single versus multiple queues, the handling of affinity/locality, and the enforcement, relaxation, or omission of task-ordering guarantees.

In shared-memory multicore systems, locality domains (e.g., NUMA regions) motivate the use of per-domain queues to conserve memory bandwidth and minimize cross-domain access times. Threads are statically mapped to these domains, and dequeue operations first sample the local queue before opportunistically "stealing" non-local work for global balance (0902.1884).
HPC batch systems use "meta-schedulers" or backfilling agents to maintain a shared, persistent task pool; multiple agents concurrently launch jobs as soon as resources become available, dynamically filling idle nodes (Berkowitz et al., 2017).
In distributed or asynchronous environments, fully replicated or sharded task queues manage synchronization and ordering via scalar clocks, vector timestamps, or distributed hashing, encountering trade-offs between protocol overhead and global task-order semantics (Baldwin et al., 4 Mar 2025, Feldmann et al., 2018).

A central design axis is the degree of relaxation: whether the queue is strict FIFO (first-in–first-out), priority-ordered, $k$ -relaxed, or almost unordered. Relaxed semantics allow large gains in throughput and scalability but must be bounded to preserve computational efficiency in many applications (Baldwin et al., 4 Mar 2025, Wimmer et al., 2013, Postnikova et al., 2021).

2. Implementation Strategies: Structures and Algorithms

Per-Domain and Work-Stealing Queues

On cache-coherent NUMA (ccNUMA) systems, a vector of FIFO queues $\text{queues}[0..\text{num\_of\_lds}-1]$ is maintained, one for each memory locality domain, with explicit thread-to-domain mapping functions (such as $\text{ld\_ID}[\cdot]$ ). Tasks are enqueued based on first-touch (memory placement) heuristics, and dequeuers lock and pop from their local queue, sampling others only on starvation (0902.1884).

Dynamic Task Pools and Backfilling

In large-scale clusters, shared task queues often materialize as directory-based pools of scripts (METAQ) or as globally managed YAML/task lists with partitioned resource pools (mpi_jm). Each backfiller maintains in-memory or in-process resource state, launching eligible tasks as soon as they fit current node/gpu availability. The queue's lifetime and contents may extend beyond any single batch allocation (Berkowitz et al., 2017).

Distributed and Relaxed Queues

In asynchronous, message-passing settings, a replicated queue is often implemented by broadcasting enqueues and dequeues with timestamps and awaiting global acknowledgments. Relaxed variants, such as $k$ -out-of-order queues, enable fast (local-only) dequeues for most operations by labeling batches of items per process; periodic synchronizations enforce global invariants (Baldwin et al., 4 Mar 2025).
Fully distributed protocols like Skueue use hashed position spaces, overlay DHT structures, and aggregation trees to batch requests, achieving $O(\log n)$ operation latency with sequential consistency (Feldmann et al., 2018).

Prioritized and Multi-Queue Schedulers

Multi-Queue architectures maintain $m \geq n$ sequential priority queues, inserting randomly, and removing via "two-choice" selection or NUMA/affinity-localized heuristics. Stealing Multi-Queues (SMQ) further combine thread-locality (CAS-free access), probabilistic work-stealing, batching, and cache-aware sampling for high throughput and bounded rank loss (Postnikova et al., 2021).

Priority queuing with $k$ -relaxation is implemented via centralized append-only arrays or hybrid local/global lists, with lock-free synchronization and per-thread spying; $k$ tunes the tradeoff between ordering and concurrency (Wimmer et al., 2013).

3. Queueing Theory and Shared Resource Modeling

At the analytical level, shared task queues have deep connections to polling systems, queueing networks, and modern stochastic models.

The single shared server problem encompasses $N$ parallel queues serviced cyclically, with Poisson or recursive routing of jobs between queues. Steady-state waiting times are characterized exactly via Laplace-Stieltjes Transforms (LSTs) and linear systems of equations derived from branching process representations (Boon et al., 2014).
In edge computing or multiple-source information systems, arrivals to shared servers may be superposed Markov-Modulated Poisson Processes (MMPP), necessitating high-dimensional or reduced-order absorbing Markov chain analysis to yield closed-form age-of-information (AoI) and latency statistics (Akar et al., 2024).

The interplay between internal and external arrivals, service disciplines (gated vs. exhaustive), and customer routing heavily influences waiting time, throughput, and fairness.

4. Task Allocation, Fairness, and Systemic Trade-Offs

In systems with multiple shared queues and agent-based servicing, allocation involves solving online assignment or matching problems:

Reactive, auction-based dispatchers collect agent bids, queue lengths, and task waiting times, posing a linear integer program at each scheduling point. Cost functions parametrized by queue penalty $q$ and wait penalty $\tau$ control the trade-off between queue balancing and fairness (waiting-time minimization), with theoretical results guaranteeing perfect queue length equalization for $\tau=0$ , and Pareto-optimal tradeoff curves otherwise (Dahlquist et al., 2023).

Key general principles:

Local sampling and work affinity minimize communication/synchronization costs.
Occasional global coordination (steals, auctions, or flushes) ensures eventual load balance or fairness.
Analytical tuning parameters ( $k$ , $B$ , $p_\text{steal}$ , $q$ / $\tau$ ) afford precise control of system behavior.

5. Practical Considerations: Performance, Scalability, and Workflow Integration

Empirical results across systems universally show:

Pure global queues or naive dynamic scheduling rapidly bottleneck due to synchronization or memory traffic, especially in NUMA or highly parallel deployments; parallel efficiency $\varepsilon$ may drop to $10-20\%$ under oversubscribed global queues (0902.1884).
Locality-aware, multi-queue, or hybrid designs restore nearly ideal scaling, delivering $95-99\%$ of the bandwidth or throughput available to statically partitioned optimal schedules (0902.1884, Postnikova et al., 2021).
In priority-based or relaxed-queue schedulers, properly tuned relaxation/batching yields minimal wasted work (1–5% task overhead) and matches the practical throughput of the best heuristic schedulers (Wimmer et al., 2013, Postnikova et al., 2021).
Workflow integration is streamlined in modern systems: for example, MEC backfill requires only bash-script tags in a directory; mpi_jm inserts two handshakes in user MPI code; shared memory/thread-based schedulers are mostly library-level drop-ins (Berkowitz et al., 2017).

A summary table of selected paradigms:

Architecture/paradigm	Synchronization/Order	Scalability
Global FIFO/priority queue	strict, global	poor
NUMA locality queues	per-domain FIFO + steal	near-linear
Multi-Queue/SMQ	probabilistically-ordered	optimal
Work-stealing	local-only; unordered	excellent
Distributed (Skueue, vector-clock)	seq. consistent/batched	$O(\log n)$
k-relaxed/hybrid priority queue	bounded relaxation	adjustable

6. Analytical and Modeling Tools for Shared Queue Systems

Mathematical formalizations are central to understanding and predicting shared task queue performance or reconstructing system observables:

Branching-process and generating-function techniques are used to derive joint queue-length and cycle-time distributions in polling networks (Boon et al., 2014).
Absorbing Markov Chain formulations, with reduced-state MMPP modeling, yield closed-form and computationally feasible AoI/loss metrics for edge-resource sharing (Akar et al., 2024).
Multi-entity partial-order event logs, PQR-systems (process, queue, resource triplets), and corresponding LPs can reconstruct unobserved or missing events and bound their possible timestamps, even where logs are incomplete (Fahland et al., 2021).

7. Open Questions, Extensions, and Future Directions

Current research highlights several directions:

Design of data structures offering “structural” (rather than temporal) $k$ -ordering relaxation (Wimmer et al., 2013), potentially reducing synchronization without weakening service-quality guarantees.
Hierarchical or NUMA-aware variants for multi-socket architectures, and integrating application-level priorities or multi-objective (Pareto) queueing (Postnikova et al., 2021, Wimmer et al., 2013).
Efficient, fault-tolerant, and scalable queueing algorithms in cloud/distributed environments, with formal consistency under churn (joins/leaves, process failures) (Feldmann et al., 2018, Baldwin et al., 4 Mar 2025).
Pareto-efficient balancing between throughput, rank guarantees, and fairness in auction-allocated or agent-based shared queues (Dahlquist et al., 2023).

Systematic benchmarking, analytical modeling, and programmable control of relaxation, affinity, and batching parameters remain essential for realizing provable and practically efficient shared task queue deployments across scientific computing, cloud systems, and large-scale data orchestration.