Malleable Scheduling
- Malleable scheduling is a model where tasks can dynamically change their resource allocation during execution, enabling improved throughput and adaptive power management.
- It leverages models like gang scheduling and LP-based assignment to address challenges such as makespan minimization and energy efficiency in distributed systems.
- Practical implementations in HPC, cloud computing, and real-time systems demonstrate significant gains in responsiveness, reduced waiting times, and optimal resource utilization.
Malleable scheduling refers to a class of scheduling models and algorithms in which jobs (or tasks) may be executed in parallel on a variable number of processing units (cores, nodes, or arbitrary shares of a homogeneous/divisible resource), with the precise resource allocation possibly changing during execution. Unlike rigid or moldable jobs (whose resource allocation is fixed prior to execution), malleable jobs can expand or shrink their allocation dynamically. This approach directly addresses modern challenges in parallel and distributed systems by enabling more efficient use of resources, improved responsiveness, adaptive power/energy management, and enhanced throughput—at the cost of substantially increased scheduling complexity. Recent research has systematically extended both the theoretical understanding and system-level implementation of malleable scheduling in contexts ranging from classical combinatorial optimization and approximation algorithms to high-performance and cloud computing infrastructures.
1. Formal Models and Definitions
A malleable job is defined by a set of parameters that characterize its execution under parallelization: workload (sequential work volume), allowed resource bounds, duration as a function of parallelism, and possibly a concave, monotonic speedup (processing) function. Given a system of (possibly heterogeneous) resources, a malleable job may at any time be assigned units—where may vary over the job’s lifetime (subject to parallelism bounds)—and completes in time , which is typically a non-increasing, sublinear function of due to communication and synchronization overheads.
Canonical models include:
- Gang scheduling with malleable tasks: Synchronous execution on varying processor sets, all instances start and complete simultaneously. The speedup vector for job encodes the rate at which work is completed given the instantaneous processor allocation (Fisher et al., 2013).
- Generalized malleable scheduling: The execution time of a job is a set function depending on the (possibly job-dependent, heterogeneous) set of allocated resources (Fotakis et al., 2019, Fotakis et al., 2021, Fotakis et al., 2021), with concave, submodular, or -concave properties controlling schedulability and amenability to efficient approximation.
- Shared-resource and divisible resource models: The total available resource is continuously split among jobs, each job receiving a time-varying share with upper bound (speedup limit), capturing bandwidth, cloud resource slices, or cores (Damerius et al., 2023).
The malleable model generalizes classical single-machine and parallel machine scheduling, moldable tasks, and gang models, and it subsumes priorities, deadlines, and various forms of precedence constraints.
2. Algorithmic Techniques and Approximation Guarantees
Processor/Resource Assignment and Scheduling
The central challenge is to jointly determine, at each instant, which subset of jobs should execute, and with what resource allocations, so as to optimize objectives such as makespan, total completion time, energy, or total value (social welfare). The solution space is high-dimensional and, under concave or submodular speedup models, strongly non-convex.
Key approaches:
- Canonical schedule and polynomial-time selection: For gang-scheduled, real-time malleable systems, feasibility reduces to checking the processor requirement for each task at a given frequency and enforcing for all , where is utilization. The optimal and processor count are found via binary search using convexity properties (Fisher et al., 2013).
- Continuous linear programming for total completion time: For divisible resource settings, the fractional schedule obtained via a continuous LP relaxation is rounded into an integral schedule, often using water-filling or flat-filling steps to minimize makespan or response time (Damerius et al., 2023).
- Assignment/schedule transformation: For concave, submodular, or -concave speedup models, LP-based assignment algorithms are used. Any schedule can be approximated to an or logarithmic factor (depending on the exact concavity) by first solving a corresponding assignment problem and then transforming to a schedule with a "well-structured" allocation (Fotakis et al., 2021, Fotakis et al., 2021).
- LP rounding and sparsified schedules: In heterogeneous and unrelated machines, LP relaxations with constraints on effective total speeds are rounded using structural sparsity properties (pseudo-forest supports) to yield constant-factor (e.g., $2e/(e-1)$) makespan approximations (Fotakis et al., 2019).
Approximation factor landscape:
- for general concave functions; for power-law speedup (Makarychev et al., 2014).
- for matroid rank or -concave speeds (e.g., 193-approximation under the broadest assumptions, 5 for special cases) (Fotakis et al., 2021, Fotakis et al., 2021).
- Lower bounds: no better than -approximation in unrelated machines; logarithmic or polynomially bad ratios without concavity (Fotakis et al., 2019, Fotakis et al., 2021).
3. System-Level Implementations and Hierarchical Scheduling
Realization of malleable scheduling in modern systems necessitates tight integration across the programming model runtime, resource management layer, and the application itself:
- Feedback-driven hierarchy: Adaptive schemes like AC-DS employ local feedback at each scheduling quantum, aggregating "desires" for resources up a scheduling hierarchy and allocating resources using dynamic partitioning (DEQ) policies. AC-DS achieves competitiveness in makespan regardless of hierarchy depth (Cao et al., 2014).
- API and runtime support: Frameworks such as DMR API provide bidirectional communication between an application's runtime (e.g., OmpSs, Nanos++, invasive MPI) and the resource manager (e.g., Slurm) to shrink, expand, or leave allocations unchanged at runtime, supporting both expansion on spare resources and contraction to admit queueing jobs (Iserte et al., 2020, Chadha et al., 2020).
- Power- and performance-awareness: Dynamic performance data (e.g., MPI/compute ratios, power counters) is collected via lightweight handlers to trigger shrink/expand decisions for jobs, either to improve system throughput or satisfy power corridor constraints using LP reallocation (Chadha et al., 2020).
- Online job control: Primitives for dynamic resource management—dmr_check_status, shrink/expand, and adaptation windows—allow applications to (a) receive adaptation requests, (b) redistribute data (e.g., via MPI_Comm_spawn, data migration protocols), and (c) minimize reconfiguration overhead in production workloads (Iserte et al., 2020).
Empirical results across diverse systems consistently show significant gains:
- Up to 60% reduction in total waiting time (Iserte et al., 2020)
- 7–19% reduction in makespan, up to 70% reduction in average slowdown, and material energy savings (D'Amico et al., 2020, Chadha et al., 2020)
- Improved resource utilization and instant start rates () for on-demand jobs when malleable applications are present (Fan et al., 2021)
4. Theoretical Foundations: Concavity Properties and Complexity
The tractability and performance of malleable scheduling algorithms are heavily influenced by concavity properties imposed on the speedup function:
- Subadditivity and submodularity: Guarantee that scheduling gains from allocating additional processors decrease (diminishing returns) and enable the transformation of any assignment into a near-optimal schedule, with precise approximation factors (Fotakis et al., 2021).
- -concavity: The strictest property considered, enabling constant-factor approximations via LP relaxations, matroid-based structures, and polymatroid intersection techniques (Fotakis et al., 2021).
- Hardness results: In general, scheduling with arbitrary or only supermodular speedup functions is hard to approximate within a polynomial or even polylogarithmic factor, reinforcing the need for structural speedup assumptions (Fotakis et al., 2019, Fotakis et al., 2021).
Crucial LaTeX expressions:
- Feasibility for multiprocessor systems (homogeneous, malleable gang):
- Assignment-to-schedule transformation guarantee (for XOS speed functions):
- Power-aware LP constraint (job assigned nodes, node waiting for nodes):
5. Applications in Real-Time, HPC, and Cloud Environments
Malleable scheduling models and algorithms have been deployed or evaluated in several critical application contexts:
- Real-time embedded multiprocessor systems: Optimization of homogeneous operating frequencies and dynamic core allocation for power minimization while guaranteeing deadline feasibility; malleability yields up to 60W energy savings relative to non-parallel schedules (Fisher et al., 2013).
- Sparse linear algebra and scientific computing: Multifrontal factorizations with task dependency trees leverage malleable scheduling to dynamically partition work, achieving performance gains over non-malleable baselines for realistic kernels ( in speedup) (Guermouche et al., 2014).
- Cloud and batch computing: Sufficient and necessary "boundary conditions" for scheduling malleable batch jobs enable precise feasibility checks, optimal resource utilization, improved social welfare, and machine minimization (Wu et al., 2015).
- Hybrid workload environments: Systems supporting on-demand, rigid, and malleable jobs achieve near-instant job start rates for on-demand tasks (), better system utilization, and strong incentives for users to declare malleability (Fan et al., 2021).
- Adaptive, reinforcement learning-based scheduling: RL-driven frameworks like MARS select between pre-trained heuristic and learned scheduling policies according to dynamic workload conditions, attaining up to 60% performance increases (Baheri et al., 2020).
- Dynamic reallocation and elasticity: Online task insertion and deletion, efficient rescheduling (with provably tight reallocation cost per insert in -slack systems), and the possibility of reducing this to one reallocation per operation in "aligned" or well-structured instances (Lim et al., 2015).
6. Limitations, Practical Challenges, and Open Problems
- Implementation hurdles: Full malleability at the system/application/programming model level remains rare. Barriers include lack of malleable application designs (especially for MPI-dominated codes), limited runtime support for dynamic resource changes, and overhead management.
- Overhead and adaptation cost: While recent implementations demonstrate low overhead for resource expansion/shrinking and job adaptation (often for dynamic rescheduling), nontrivial costs can arise due to data movement, checkpointing, and communication topology changes in real workloads (Iserte et al., 2020).
- Approximation constants: State-of-the-art constant-factor approximation ratios for the most general concave models remain large (e.g., 193 (Fotakis et al., 2021)), while further tightening for submodular/XOS models is open.
- Dynamic or online malleable scheduling: While online policies such as water-filling are optimal in some settings (Damerius et al., 2023), the extension of these guarantees (especially for total completion time objectives or more general speedup models) remains an active direction.
- Resource augmentation and fairness: The black-box reduction between malleable scheduling and max-min fair allocation establishes a pathway to resource-augmented O(1)-approximate MMFA algorithms, but bridging the gap between scheduling efficiency and fair allocation in decentralized/hierarchical systems is ongoing (Fotakis et al., 2021).
7. Future Directions and Broader Impact
The formal advances in malleable scheduling theory are already reflected in the design of batch schedulers, HPC cluster resource managers, and application frameworks capable of exploiting dynamic elasticity. Looking forward:
- Integration with predictive and machine learning models for runtime performance, resource prediction, and cost-aware scheduling will enhance adaptation and efficiency (Baheri et al., 2020).
- Heterogeneous system support: Generalized speedup models accounting for CPUs, GPUs, and other accelerators, as well as complex memory/network hierarchies.
- Energy and power management: Joint power-aware and performance-aware dynamic resource reallocation, using malleability as a first-class scheduling primitive (D'Amico et al., 2020, Chadha et al., 2020).
- Dynamic workflows and hybrid workloads: Comprehensive frameworks for workflows requiring on-demand flexibility, fair access, and low tail latency, all within a malleable scheduling substrate (Fan et al., 2021).
- Robust approximation algorithms for new concavity classes: Constant-factor bounds for broader speedup function families remain a theoretical and practical open challenge.
In sum, malleable scheduling now encompasses an interdisciplinary domain bridging combinatorial optimization, online algorithms, resource management in real-world distributed systems, and emerging requirements in large-scale computational science and cloud operations.