Learning-Based Scheduling on Unrelated Machines

Updated 29 December 2025

The paper introduces learning-based strategies, including permutation predictors, deep neural networks, and reinforcement learning, to enhance scheduling on heterogeneous machines.
It leverages advanced techniques such as PAC learning, Transformer encoders, and MAPPO to provide theoretical guarantees and robust performance across varying problem scales.
Empirical studies demonstrate that these methods outperform classical algorithms, achieving near-optimal results and practical scalability even under uncertainty and dynamic constraints.

Learning-based scheduling on unrelated machines addresses the problem of allocating jobs to machines under heterogeneous processing capabilities, potentially with additional complications such as setup times, worker constraints, or uncertainty in job characteristics. The unrelated machine model—where each job-machine pair is associated with a specific processing time—poses substantial challenges for both online and offline algorithmic design. Recent research has focused on integrating learning-based methods, including deep neural network policies, permutation-based predictors, and multi-agent reinforcement learning, to overcome these challenges and achieve provable performance as well as practical scalability.

1. Formal Problem Setting and Complexity

The unrelated parallel machine scheduling problem (UPMS) is defined over $M$ machines and $J$ jobs, where each job $j$ has a machine-dependent processing time $p_{jm}$ , deadline $d_j$ and weight $w_j$ . Each machine $m$ may have its own deactivation deadline and weight, with $r_m$ denoting the machine’s initial occupied runtime. The canonical offline objective is to minimize a cost function that might include makespan $C_{\max}$ , total weighted job tardiness $\sum_j w_j T_j$ , and machine tardiness $\sum_m \omega_m \tau_m$ :

$R_m\,|\,d_j, \delta_m, w_j, \omega_m, r_m\,|\,C_{\max} + \sum_j w_j T_j + \sum_m \omega_m \tau_m$

The problem remains strongly NP-hard in most settings, notably when minimizing makespan or total weighted completion time, due to the combinatorial richness of job-machine assignments and the absence of any uniformity or ordering in the $p_{j,m}$ matrix. Constraints such as sequence-dependent setup times and human workforce compatibility (as modeled by binary matrices $o_{w,m}$ and job-specific worker requirements $r_{j,m}$ ) further increase the problem’s dimensionality and practical significance (Zampella et al., 12 Nov 2024, Hitzges et al., 22 Dec 2025).

2. Learning-Augmented Online Scheduling

In the online, non-clairvoyant setting each job’s processing requirements and, in some cases, its most efficient machine, are unknown a priori. The approach in "Permutation Predictions for Non-Clairvoyant Scheduling" (Lindermayr et al., 2022) advances the state of the art by extracting relative orderings ("permutations") of jobs for each machine, as predictions $\Sigma=\{\sigma_i\}_{i=1}^m$ , rather than regressing on job sizes directly. At each decision point, a prediction $\Sigma$ specifies an assignment and within-machine order, reflecting the structure of a weighted shortest processing time (WSPT) schedule.

Prediction quality is encoded via the delay error $\eta(J,\Sigma)$ —the sum over jobs of the marginal increase in weighted completion time induced by deviation from optimal assignments. $\eta(J,\Sigma)$ is PAC-learnable via empirical risk minimization over permutations, and can be computed for offline or online learning. The Preferential Time Sharing (PTS) algorithm then mixes a "clairvoyant" algorithm that trusts $\Sigma$ with a robust non-clairvoyant fallback (proportional fairness). The resulting competitive ratio function $C(\eta)$ gracefully degrades between $5.83$ (with perfect predictions) and $128$ (worst-case), the best known for non-clairvoyant unrelated machine scheduling (Lindermayr et al., 2022). Empirical results demonstrate the robustness of permutation-based learning relative to both classical and prior learning-augmented heuristics.

3. Deep Neural Approaches for Variable-Dimension Scheduling

Offline, deterministic scheduling of unrelated machines with highly variable dimensions and objectives is addressed by neural architectures designed explicitly to process variable-size input tensors (Hitzges et al., 22 Dec 2025). The challenge arises since each job-machine pair $(j,m)$ has independent features (e.g., $p_{jm}$ ), and the size of the scheduling instance can be arbitrary.

The architecture proposed in (Hitzges et al., 22 Dec 2025) encodes input states as follows: machine-feature sequences are embedded per job via a BiLSTM, with urgency vectors passed through self-attention over all jobs, producing a $J\times 16$ job embedding matrix. This is followed by a Transformer encoder to enable global inter-job context exchange, and a pointer-network style decoder that emits a probability distribution over all valid job assignments and a "deactivate machine" action. Training is fully supervised, using optimal solutions generated via enumeration for small-scale ( $J=8$ , $M=4$ ) instances, with a softmaxed-inverse target to emphasize optimal action ranking.

Generalization to substantially larger instances is demonstrated: when applied to up to 100 jobs and 10 machines, the neural network consistently outperforms a strong dispatching rule, with average costs only 2.51% above optimal on small test sets and 22.22% lower than the rule on larger problem sizes, confirming the approach's robust cross-scale effectiveness (Hitzges et al., 22 Dec 2025).

4. Reinforcement Learning and Multi-Agent Formulations

For sequential and resource-constrained variants of the UPMS—including setup times and human resource compatibility—recent work has cast the problem as an MDP. State representations encode per-machine remaining times, current workloads, available workers, and up to $K$ -slot job buffers (Zampella et al., 12 Nov 2024). Both single-agent (maskable PPO) and multi-agent (MAPPO) reinforcement learning frameworks have been deployed.

In the single-agent regime, Maskable-PPO uses an MLP with infeasibility-masked logits, accelerating convergence in settings with moderate numbers of machines and jobs. In multi-agent MAPPO, each machine acts as an independent agent with local observations, while a centralized critic enables joint training. Reward shaping penalizes infeasibility, repeated idle actions, and non-minimal resource usage while incentivizing feasibility, minimizing processing, and throughput.

Empirical studies indicate Maskable-PPO attains rapid convergence on small instances but suffers scalability issues (input/output dimensions scale with $M\times K$ ), whereas MAPPO, with linear scaling in $M$ , maintains robustness and stable performance as the number of machines increases. Thus, decentralized RL policies, especially those with centralized training, offer a scalable route for industrial-scale UPMS variants (Zampella et al., 12 Nov 2024).

5. Robustness to Prediction Error and Speed-Oblivious Learning

Research on speed-oblivious models reveals that precise knowledge of all job-machine speed pairs is unnecessary for achieving near-optimal performance if (potentially noisy) predictions or even simple machine orderings are available (Lindermayr et al., 2023). When given predicted speeds $\hat s_{ij}$ with distortion $\eta=\eta_1\eta_2$ , learning-augmented algorithms achieve competitive ratios of $8\eta$ to $108\eta$ depending on the use of preemption and migratory capabilities. In purely speed-ordered models, round-robin and greedy policies still ensure constant-factor competitiveness.

Empirical deployments on heterogeneous multicore processors confirm that robust, prediction-driven scheduling policies can closely match oracle performance for low to moderate prediction noise, and significantly outperform classical non-clairvoyant algorithms, especially under heavy load or when prediction error is moderate ( $\sigma<1$ ). As $\sigma$ increases, graceful degradation to established lower bounds is observed (Lindermayr et al., 2023).

6. Comparative Summary of Algorithmic Paradigms

Methodology	Problem Scope	Key Guarantee / Result
Permutation-Predicting PTS (Lindermayr et al., 2022)	Online, non-clairvoyant, unrelated	$5.83$-competitive (with perfect permutations), $128$-competitive (robust, worst-case); error-sensitive bounds
Variable-Input Deep Net (Hitzges et al., 22 Dec 2025)	Offline, deterministic, variable $J,M$	2.51% above optimal ( $J=8,M=4$ ), dispatching rule is 22.2% worse on $J\leq100$
RL and MAPPO (Zampella et al., 12 Nov 2024)	Sequential, setup/resource constrained	Maskable-PPO excels for small/medium size, MAPPO scales for large $M$
Speed-Augmented Schedulers (Lindermayr et al., 2023)	Online, speed-oblivious	Competitive ratio scales linearly (or quadratically) with noise distortion $\eta$

Each approach leverages learning in a distinct manner: permutation-based predictors encode structural scheduling intuition; deep neural models process variable-dimension combinatorics; reinforcement learning accommodates temporal, resource, and sequential structure; speed-oblivious algorithms exploit prediction robustness. The techniques are complementary and can be selected or integrated according to the specific requirements of application settings, prediction availability, and instance scale.

7. Practical Implications and Prospects

Learning-based scheduling on unrelated machines has evolved to address both worst-case robustness and average-case empirical performance, across offline and online models, and spanning both deterministic and stochastic processing. The development of architectures capable of input-agnostic scheduling (variable jobs, machines, and features) (Hitzges et al., 22 Dec 2025), as well as the principled integration of predictions or ordinal knowledge (Lindermayr et al., 2022, Lindermayr et al., 2023), enables application at industrial and computational scale. Centralized and decentralized RL approaches provide a bridge to real-world resource constraints, with architectural innovations such as action masking, multi-agent decomposition, and tuned reward shaping ensuring tractable convergence and policy interpretability (Zampella et al., 12 Nov 2024).

A plausible implication is that future progress will likely combine these methods: neural architectures suited for variable and dynamic inputs, RL-based adaptation to constraint-rich, sequential environments, and learning-augmented online controllers robust to uncertainty in processing characteristics. The field continues to unify theoretical guarantees (competitive analysis, PAC learnability) with demonstrated cross-scale generalization and empirical superiority over classical rules, fundamentally reshaping scheduling practices on heterogeneous computational and manufacturing platforms.