Cost-Guided Scheduling Algorithm

Updated 14 November 2025

Cost-guided scheduling algorithms are defined by detailed, parameterized cost models that assign work units based on energy, time, monetary, or reliability metrics.
They employ offline cost modeling and a two-stage split-and-pack strategy, combining greedy assignment with load-balanced scheduling to optimize performance.
Empirical results demonstrate up to 18.9% throughput gains and 67% energy-delay product reduction compared to static or random mappings in hybrid architectures.

A cost-guided scheduling algorithm is any scheduling mechanism in which explicit, quantitative models of resource or application cost drive the mapping of units of work (e.g., neural network subgraphs, workflow tasks, virtual machines, or application columns) onto available computational components. Such algorithms assign scheduling decisions based on detailed, often parameterized, cost predictions—typically involving energy, time, monetary, or reliability metrics—with the objective of optimizing composite criteria such as energy-delay product (EDP), monetary expenditure, reliability, or a scalarization thereof. In this context, cost-guided scheduling is a central tool at the intersection of computer architecture, cloud computing, and large-scale workflow management, underpinning the efficient allocation of computational workloads within complex, constrained hardware and software platforms.

1. Principles and Mathematical Structure of Cost-Guided Scheduling

Cost-guided scheduling algorithms are characterized by the following technical principles:

Work Quantization: Granular decomposition of the workload (e.g., output columns in a DNN layer (Manjunath et al., 7 Nov 2025), DAG tasks (Tekawade et al., 2023), or serverless functions (Palma et al., 2023)) such that cost can be attributed to each unit independently.
Offline or Online Cost Modeling: Construction of per-unit cost models parameterized by system configuration, input statistics, or microbenchmarks. For example, in NeuroFlex, each output column $i$ is allocated a predicted number of nonzero dot-product matches $\tilde r_i$ ; core-specific parameters $(E_a, B_a, S_a, d_a)$ are estimated via microbenchmarks.
Assignment Optimization: The core problem is a combinatorial optimization: given $n$ units and $k$ scheduling "modes" (e.g., ANN core, SNN core), find an assignment $X$ that minimizes an objective function, often EDP, a convex scalarization, or a more general multi-criteria utility.
Surrogate Scalarization and Pareto Trade-off: Scalarized objectives such as $\Phi(X) = E(X) + \Lambda D(X)$ (with $\Lambda\gg 0$ ) capture the trade-off between competing metrics, e.g., energy and delay. $\Lambda$ is set to place the scheduler near the knee of the Pareto frontier.

The standard mathematical structure is: $\begin{aligned} &\text{Given cost models}\ \{e_a(i),\, l_a(i)\}\ \text{for all units %%%%9%%%% and modes %%%%10%%%%} \ &\text{Define assignment vector}\ X \in \{A, S\}^n \ &\text{Energy:}\quad E(X) = \sum_{i \in X_S} e_S(i) + \sum_{i \in X_A} e_A(i) \ &\text{Delay:}\quad D(X) = \max\{T_A(X), T_S(X)\} \ &\text{Objective:}\quad \mathrm{EDP}(X) = E(X) \cdot D(X)\ \text{or}\ \Phi(X) = E(X) + \Lambda D(X) \ \end{aligned}$

2. Constructing and Using Offline Cost Models

Central to any cost-guided scheduler is an accurate, workload- and system-specific offline cost model:

Statistical Profiling: For each work unit, empirical distributions of workload parameters (e.g., number of MAC matches in a DNN column) are computed over a validation set. NeuroFlex uses the $q$ -th quantile ( $q=0.9$ ) $\tilde r_i$ over validation inputs (Manjunath et al., 7 Nov 2025).
Hardware Calibration: Architectural characteristics such as energy and time per operation ( $E_a$ , $B_a$ ), per-column setup overheads ( $S_a$ , $d_a$ ), memory bandwidth, and spike-generation penalties are measured using microbenchmarks.
Per-Unit Cost Synthesis: The cost to assign unit $i$ to mode $a$ is $e_a(i)=E_a\tilde r_i+S_a$ (energy) and $l_a(i)=B_a\tilde r_i+d_a$ (latency).

This enables predictive assignment before deployment, ensuring deterministic runtime and avoiding expensive online search or mode switches.

3. Two-Stage Scheduling Algorithms: Assignment and Packing

The canonical cost-guided algorithm in column-exact accelerator contexts follows a two-stage structure, exemplified by the NeuroFlex scheduler (Manjunath et al., 7 Nov 2025):

Stage 1: Split by Marginal Surrogate Cost

For each unit $i$ , compute its marginal surrogate cost $S_a(i) = e_a(i) + \Lambda l_a(i)$ for every mode $a$ .
Assign $i$ to $\arg\min_a S_a(i)$ —this is a greedy, per-column policy minimizing first-order cost increases in the scalarized objective $\Phi$ .
Complexity: $O(n)$ for $n$ units.

Stage 2: Packing for Load Balance (Longest Processing Time First—LPT)

For each mode $a$ , sort assigned units by descending $l_a(i)$ .
Assign units in order to the $P_a$ processing elements (PEs) for mode $a$ , each time selecting the PE with the minimal cumulative assigned $l_a$ .
This "bin packing" ensures balanced computation across PEs, minimizing makespan.
Complexity: $O(n\log n + nP_a)$ .

Optional Local Refinement:

Iterate over units, propose core flips and repack only affected PEs; accept the best move if it reduces $\Phi$ .
Typically terminates in 3–5 passes; each pass is $O(n)$ .

This structure ensures both global cost efficiency (via per-unit assignment) and maximal resource utilization through balanced packing.

4. Hardware and System-Level Constraints Incorporated

The cost-guided scheduling approach incorporates detailed architectural realities:

Quantized INT8 Dataflow: All activations and weights are stored in INT8; spike trains are generated dynamically only when (column, input) nonzero matches occur, leading to both storage and compute gains.
FiberCache Memory Hierarchy: 512KB on-chip, 128GB/s HBM, with deterministic access via column-aligned addressing and fixed, layer-specific bitmasks.
Sparse Prefix Logic: Logic for synchronizing input- and weight-sparsity through prefix summation ensures that only aligned nonzero pairs trigger compute/spike events, reducing energy wasted on zero elements.
Fixed Overhead Modeling: Per-column energy/time overheads ( $S_a$ , $d_a$ ) account intricately for spike-generation, thresholding, writeback, and compression logic, all parameterized and calibrated.

Determinism and Bit-Exactness: Because assignments, bitmasks, and packing permutations are computed entirely offline, runtime execution is fixed and bit-exact for every input.

5. Performance Outcomes and Empirical Impact

The efficacy of cost-guided scheduling is confirmed by comprehensive benchmarking (Manjunath et al., 7 Nov 2025):

Throughput Gains: Compared to random column-to-core mappings, throughput improvements range from 16% (GoogLeNet) to 18.9% (ResNet-34). This is attributed to enhanced load balance and optimal hybridization across processing modes.
EDP Reduction: Against a competitive ANN-only baseline (SparTen), EDP is lowered by 57–67% across all tested models (VGG-16, ResNet-34, GoogLeNet, BERT).
Resource Utilization: PE utilization sustains at >97% with cost-guided mapping (vs. ~92% with random and 60–83% with coarse layerwise mapping).
Competitive Baselines: NeuroFlex achieves up to $2.5\times$ speedup over LoAS (SNN-only) and $2.51\times$ energy reduction over SparTen. Against Prosperity (another SNN accelerator), speedup reaches $12\times$ – $14\times$ .

The empirical results indicate that integer-exact, column-level hybrid scheduling—not layerwise or single-mode—offers the best EDP/throughput tradeoff for sparse DNN/SNN edge workloads.

6. Theoretical Significance and Generalization

The cost-guided scheduling paradigm exemplified by NeuroFlex is significant as it:

Demonstrates that deterministic, microarchitecturally-informed cost minimization at sublayer granularity directly outperforms both coarse and random mode assignments for hybrid architectures.
Establishes a general methodology for energy-delay optimization applicable to a broad variety of accelerator designs (beyond DNN/SNN hybridization) provided accurate, granular cost models are available.
Shows that scalarized, convex surrogate objectives enable efficient near-optimal search within an intractable assignment space ( $2^n$ possible mappings).

A plausible implication is that as fine-grained workload and hardware variability increases (e.g., more pronounced sparsity, input-dependent execution), this cost-guided, data-driven scheduling will replace static or coarse-grained heuristics in next-generation high-performance and edge AI platforms.

7. Comparative Perspective and Implementation Trade-offs

Complexity:

Initial assignment and LPT packing operations scale as $O(n \log n + nP)$ , well within the bounds for offline deployment. Local refinement adds minimal (sublinear) overhead due to restricted repacking.

Offline vs. Online Cost Model:

Offline models guarantee deterministic performance and do not incur runtime profiling costs, but rely on representative validation statistics and accurate microbenchmarks.
No dynamic adaptation is performed at runtime; any workload drift not accounted for at model-building time cannot be corrected in deployment.

Trade-offs:

Assigning units strictly by the current cost model may occasionally result in non-Pareto-optimal hybridizations when hardware parameters change. However, this is mitigated by the margin of improvement (up to 67% EDP reduction) over baselines.
The ability to guarantee bit-exact output equivalence to the reference ANN model is a direct consequence of integer-exact and deterministic scheduling, which is critical for deployment in safety- and correctness-sensitive applications.

Summary Table: (example, for throughput/EDP outcomes)

Model	Throughput Gain vs. Random	EDP Reduction vs. ANN
VGG-16	17.8%	65.0%
ResNet-34	18.9%	57.5%
GoogLeNet	16.2%	67.1%
BERT	16.7%	57.4%

In summary, cost-guided scheduling algorithms enable optimal or near-optimal assignment of computational units to hardware modes based on fine-grained, empirically calibrated cost models. Such schedulers are fundamental to achieving energy, performance, and accuracy objectives in heterogeneous, hybrid, and fine-grained accelerator architectures. The NeuroFlex column-exact scheduler is a prototypical instantiation using a two-stage split-and-pack approach, offline statistics, and deterministic packing, resulting in large, empirically substantiated improvements over both random and static scheduling baselines (Manjunath et al., 7 Nov 2025).