Pilot-Job and Pull Scheduling Overview
- Pilot-Job and Pull Scheduling are execution models that decouple resource acquisition from task scheduling to enhance scalability and efficiency in distributed environments.
- They use a late-binding approach where resource placeholders pull tasks from a central queue, enabling dynamic workload distribution and robust fault tolerance.
- Empirical results demonstrate improved resource utilization and reduced turnaround times, making them ideal for large-scale HPC and data-intensive applications.
A Pilot-Job is an execution abstraction that enables large-scale distributed computing by decoupling resource acquisition from task scheduling. The Pilot paradigm introduces resource placeholders—“pilot jobs”—that are first submitted to and activated on distributed computing resources, after which tasks are late-bound (scheduled) dynamically into those placeholders. Pull scheduling, also termed late-binding or agent-driven assignment, underpins most modern Pilot-Job systems. In this approach, once pilots are deployed and ready, they pull (request) available tasks from a centralized work or task queue, enabling scalable, robust, and elastic workload distribution across heterogeneous, dynamic, and high-throughput resource environments.
1. The Pilot-Job Abstraction
The Pilot-Job abstraction is characterized by clear separation of responsibilities: resource provisioning, workload management, and task execution. Implementations generally define three logical modules:
- Pilot Manager: Responsible for acquiring resources by submitting “pilot jobs” (large resource placeholders) to batch/resource managers such as SLURM, PBS, Condor-G, LoadLeveler, or grid schedulers. The pilot embodies a bundled allocation of nodes/cores/GPUs, masking the heterogeneity and latency of underlying batch queues (Merzky et al., 2018, Merzky et al., 2015, Luckow et al., 2012, Turilli et al., 2015).
- Workload Manager (sometimes named UnitManager or Work Queue): Accepts task descriptions (Compute Units, or CUs) and manages their lifecycles. Tasks are queued in a persistent broker/database (e.g., MongoDB, Redis) for future assignment to active pilots (Merzky et al., 2018, Merzky et al., 2015, Luckow et al., 2013).
- Task Manager (in-pilot agent): Bootstrapped within each pilot upon activation, responsible for (a) registering with the workload manager, (b) requesting/binding tasks to local resources, (c) staging data, launching user payloads, and monitoring execution status (Merzky et al., 2018, Turilli et al., 2015).
By decoupling resource acquisition (Pilot Manager) from task execution (Task Manager), the Pilot-Job abstraction enables dynamic late-binding: tasks are only scheduled when suitable resources are actually available, avoiding the queuing latencies and inefficiencies of pre-binding.
2. Pull Scheduling: Principles and Mechanisms
Pull scheduling refers to the loop wherein pilot-side agents request (pull) tasks from the central work queue as soon as they have available capacity. This model contrasts with push scheduling, in which a central dispatcher pushes tasks to resources based on its view of resource states.
A typical pull-scheduling loop for each pilot agent follows these steps (Merzky et al., 2018, Merzky et al., 2015, Hasham et al., 2012, Luckow et al., 2012, Luckow et al., 2013, Turilli et al., 2015):
- Registration: Upon startup, each Task Manager informs the Workload Manager of its capacity.
- Polling: Agents periodically poll the central task queue/database (e.g., MongoDB, Redis) for ready tasks, requesting as many as they can concurrently service.
- Assignment: Upon retrieving a task, the agent examines resource requirements (core/rank count, affinity labels), performs data staging as needed, and launches the executable via system-specific launchers (ORTE, APRUN, SSH, MPIEXEC).
- Execution Tracking: The agent monitors task execution, completes required post-processing/data movement, and updates the central work queue with results and status.
- Repetition: On task completion or slot availability, the agent repeats the pull for as long as the pilot allocation (walltime or expiration) persists.
The formalism underlying pull scheduling can be captured via queueing models. For a system with pilots, pull frequency per pilot, and service rate , the effective dispatch rate is bounded by and the aggregate service rate . System throughput is where is the task (or Scheduling Unit) arrival rate (Luckow et al., 2012, Turilli et al., 2015).
Mean task dispatch delay is inversely proportional to the pull frequency: a newly arrived task waits on average before assignment (Luckow et al., 2012).
3. Performance Models and Empirical Results
Pilot-Job and pull scheduling systems are evaluated by diverse performance metrics, typically including task throughput , scheduler overhead , resource/core utilization , weak- and strong-scaling efficiency, and end-to-end turnaround (Merzky et al., 2018, Merzky et al., 2015, Hasham et al., 2012, Luckow et al., 2013).
Key formal metrics include:
- Task throughput:
- Scheduling overhead:
- Weak-scaling efficiency:
- Strong-scaling efficiency: with
Experimental validation on the Titan supercomputer with RADICAL-Pilot (RP) demonstrated scalable execution up to cores and concurrent 32-core MPI tasks. Weak scaling showed resource utilization up to cores but overheads grow with scale due to scheduling and launch latencies. Bottlenecks emerged from in-memory scheduling and, at the largest scales, ORTE launcher jitter. Introducing a specialized scheduler reduced per-task assignment from to CUs/sec (Merzky et al., 2018). In RP on Cray systems, throughput exceeded $100$ tasks/sec and concurrency $16K+$ tasks, but efficiency degradation occurred beyond $8K$ cores due to lock contention (Merzky et al., 2015).
In the CMS Tier0 production workflow, pilot-pull reduced end-to-end workflow turnaround by $25$– and data stage-in times by up to under heavy storage element (SE) load, largely by exploiting cache-aware scheduling and eliminating local queue latencies (Hasham et al., 2012).
4. Architectural Realizations and System Variants
Multiple implementations conform to the Pilot-Job with pull scheduling paradigm, instantiated by research groups at scale:
- RADICAL-Pilot (RP): Modular Python platform for supercomputers, using SAGA for pilot submission, MongoDB for work queueing, pull-loop agents with in-memory and specialized scheduling, and support for multiple parallel launchers (Merzky et al., 2018, Merzky et al., 2015).
- BigJob/Pilot-API: Implements pull-based scheduling over Redis with “PilotManager,” elasticity via resource and data “affinity,” and explicit support for co-placement of compute and data units (Pilot-Data) (Luckow et al., 2013).
- Condor Glidein and DIANE: Pilots realized as “Startd” or “WorkerAgents” periodically poll central managers (Schedd, RunMaster) for available tasks, with pilot registration and task execution modeled in the P* abstraction (Luckow et al., 2012).
- CMS Pilot Infrastructure: On CERN Tier0/Tier2 clusters, pilots wrap task execution with intelligent local data caches and cache-aware task assignment, improving data locality via pull scheduling and LRU-based cache management (Hasham et al., 2012).
- Production Grids: Frameworks such as PanDA and GlideinWMS routinely deliver tasks/day and utilize $700$ million CPU hours/year, validated with pull-based pilot scheduling (Turilli et al., 2015).
A generalized architecture entails centralized Pilot/Workload Managers, persistent work queues (MongoDB, Redis), agents per pilot, and optional affinity or data locality constraints for optimized task-to-resource or data-to-compute co-placement (Luckow et al., 2013). The scheduling primitives—task enqueueing/pulling, agent polling, and status updates—are typically exposed via well-defined APIs (e.g., Pilot-API (Luckow et al., 2012)).
5. Data-Aware Pull Scheduling and Extensions
Pilot-Job systems often integrate cross-cutting concerns such as data locality, cache management, and affinity. In Pilot-Data, each pilot agent may manage both compute and storage slots, pulling CUs (compute units) and DUs (data units) from distinct, possibly global work queues. Task assignment can be guided by explicit resource or compute/data affinity, which may be defined by functions such as
scheduling tasks to minimize data movement subject to capacity constraints (Luckow et al., 2013). In distributed scientific workflows, the formal cost model incorporates queue, compute, staging, and transfer times, and the decision to move compute to data or data to compute is based on comparative values of these terms.
Cache-aware pilot systems, such as deployed in CMS, further optimize task pulling and dispatch based on local peer cache sharing and data reuse policies (LRU eviction, waitForData), demonstrably increasing cache hit ratios and reducing turnaround times by more than under high loads (Hasham et al., 2012).
6. Comparative Analysis: Pull vs. Push Scheduling
Pull scheduling offers distinct advantages over push strategies in distributed environments:
- Late Binding: Tasks are assigned to concrete resources only upon agent pull, reducing the risk of wasted scheduling to delayed or failed resource allocations (Merzky et al., 2015, Turilli et al., 2015, Luckow et al., 2012).
- Elastic Load Balancing: Faster or less burdened pilots pull more frequently, automatically balancing work without central state awareness; pilot failures do not entail task loss (Luckow et al., 2012, Turilli et al., 2015).
- Scalability: Control over system load as the pilot pool scales; central queue architectures must be engineered to absorb high rates of pull requests (Turilli et al., 2015).
- Adaptivity: Supports dynamic workflows, heterogeneous resources, and customizable scheduling (including data/compute affinity and co-placement) (Luckow et al., 2013, Merzky et al., 2015).
- Downsides: Increased dispatch latency proportional to the pull frequency, risk of load on central queue/broker, possible fairness limitations unless explicit in workload manager policy (Luckow et al., 2012, Turilli et al., 2015).
Morphologically, push schedulers may reduce per-task assignment lag but complicate fault tolerance and state tracking; this suggests pull scheduling is favored for robustness, elasticity, and resource dynamism in execution environments spanning supercomputers, cloud, and grid platforms (Turilli et al., 2015, Luckow et al., 2012).
7. Scalability, Robustness, and Limitations
Pilot-Job with pull scheduling has exhibited production-grade scaling and robustness in extreme environments:
- In RADICAL-Pilot, control-path bottlenecks (scheduler and executor locks, launch subsystem jitter) define the strong scaling limits; specialized, O(1) scheduling algorithms and launch optimizations ameliorate these at large scales (Merzky et al., 2018, Merzky et al., 2015).
- System throughput is ultimately bounded by either agent-side pull rate or pilot service rate, and queueing models (M/M/C) provide closed-form predictions for throughput and average task waiting times (Luckow et al., 2012, Turilli et al., 2015).
- Shared queue or database scalability (MongoDB, Redis) can become single points of contention and require sharding, back-off, or federation strategies in very large pilot pools (Merzky et al., 2015, Turilli et al., 2015).
- Cache-based, data-aware pull (CMS, Pilot-Data) enhances efficiency, but pilot churn and transient resource loss (pilot expiry, spot node failure) necessitate careful handling of work queue consistency and data locality (Hasham et al., 2012, Luckow et al., 2013).
The Pilot-Job paradigm with late-binding and pull scheduling is the dominant model for scalable many-task, data-intensive, and workflow-driven scientific computing on leadership-class clusters, grids, and cloud resources (Turilli et al., 2015, Luckow et al., 2012, Merzky et al., 2015, Merzky et al., 2018, Hasham et al., 2012, Luckow et al., 2013).