Cross-Layer Workload Archetypes
- Cross-layer workload archetypes are defined behavioral classes that capture resource usage, control-flow, and communication patterns across multiple system layers.
- They use multidimensional metrics and statistical models to quantitatively map specific workload behaviors to hardware, OS, and language-level design decisions.
- These archetypes drive co-design strategies by linking application features to resource provisioning, benchmarking, and optimization in diverse computational domains.
Cross-layer workload archetypes are rigorously defined behavioral classes that capture the distinctive resource utilization, operator structure, and system interactions manifested by workloads as traced across multiple architectural and software layers. These archetypes distill the diversity of modern computational tasks—ranging from exascale co-designed applications to XR kernels, HPC jobs, and deep learning recommendation systems—into parameterized models that drive hardware, OS, and language-level design decisions. Archetypal classification leverages multidimensional metric spaces (e.g., computation, communication, control-flow, locality, capacity, bottleneck sources) coupled to statistical and analytic models, enabling precise mapping from algorithmic fingerprint to system-level optimization and resource provisioning (Dhanasekar et al., 2018, Shi et al., 15 Jan 2026, Simakov et al., 2018, Hsia et al., 2020).
1. Archetypal Taxonomies by Domain and Metric
The foundational principle underlying cross-layer archetypes is the partitioning of workload behavioral space into distinct dimensions, each constituting a class with a specific taxonomy and associated quantitative model.
Many-core co-design workloads are described by four orthogonal “C-dimensions,” each constituting an archetype (Dhanasekar et al., 2018):
- Computation-complexity (C₂): Numeric, semi-numeric, non-numeric, and general operations, parameterized by node weight (algorithm size, arithmetic intensity).
- Communication-complexity (C₁): Local vs. global, fan-in/fan-out, with edge weight (bytes × depth), decomposed into CEF/CIF metrics per node and a global communication vector .
- Control-flow complexity (C₃): Sequencing vs. multi-branch, annotated via probability vector on hybridgraph nodes.
- Locality-of-reference (C₄): Spatio-temporal vs. random access loops, with randomized address/stride distributions.
HPC portfolio analysis via XDMoD yields a five-archetype taxonomy using joint distributions of CPU utilization, memory, I/O, network, and parallelism (Simakov et al., 2018): A) Compute-bound (high CPU-user, modest memory) B) Memory-intensive (high per-core RAM) C) I/O-bound (high Lustre, low compute) D) Communication-heavy (dominant network bandwidth) E) Single-node/throughput (minimal shared resource use)
XR workloads are mapped into four behavioral archetypes with phase alternation (Shi et al., 15 Jan 2026): I) Capacity-gated transform pipelines (working-set thresholds, LLC/DRAM sensitivity) II) Flat-response matching (no cache scaling gain, irregularly indexed tensors) III) Balanced/cache-friendly (early capacity saturation, high locality) IV) Irregular/overhead-sensitive (control-dominated, sync and stalled execution) Plus a temporal overlay for phase-alternating workloads.
Deep recommendation inference is organized into four archetypes by operator and microarchitectural profile (Hsia et al., 2020): A) Compute-bound (multilayer FCs, high retiring) B) Memory-bound (embedding lookups, DRAM stalls) C) Attention-bound (decoder/i-cache front end, branch pressure) D) Hybrid (mixed operator, balanced bottlenecks)
2. Quantitative Modeling and Archetype Criteria
Each archetype is rigorously formalized by quantitative models, threshold criteria, and metrics:
- C-dimensions (synthetic graphs): Node and edge weights sampled from application-specified probability distributions. Statistical instantiation involves Poisson/Uniform/Binomial/Normal laws for structure generation and parameter assignment (Dhanasekar et al., 2018).
- HPC archetypes: Defined by LaTeX-formulated criteria (e.g., ) as joint constraints over multiple layers (Simakov et al., 2018).
- XR kernels: Classification based on empirical GPU/CPU KPI thresholds and simulated annealing models optimizing compute, data traffic, and fragmentation, as well as explicit expressions (e.g., ) and phase classifier logic (Shi et al., 15 Jan 2026).
- Recommendation models: Correlations (), multivariate regressions linking model features to pipeline slots, and detailed operator breakdown (Hsia et al., 2020).
This formalism ensures that archetype assignment is both reproducible and actionable across tools, benchmarks, and simulators.
3. Workload Generation, Statistical Instantiation, and Surge Modeling
Synthetic workload generators provide user-driven configurability to instantiate archetypes with targeted complexity fingerprints and distributional variety (Dhanasekar et al., 2018). The process encompasses:
- Sampling graph depth, node counts, fan-in/fan-out degrees, and node/edge weights from specified distributions.
- Probabilistic modeling of control-flow (Dirichlet for branch selection) and loop locality (random variable address/stride).
- Parameterization of computational and communication surges using time-varying scaling factors: , These surges stress hardware elasticity, modeling bursty, non-uniform workload phases.
4. Cross-Layer Mapping and Co-Design Integration
Archetypal workload models serve as the linking substrate for cross-layer co-design, binding application features to language templates and architectural realizations (Dhanasekar et al., 2018, Shi et al., 15 Jan 2026, Hsia et al., 2020):
- Application layer: Extraction of (C₁, C₂) graphs plus control-flow annotations from user code.
- Parallel-language layer: Embedding hypervertex mapping into language constructs that encode locality and control-flow semantics.
- Architecture layer: ALFU clustering and core mapping, computation of adjacency matrices for network sizing, throughput, and interconnect topology. Algorithms perform clustering by bytes-exchanged affinity, mesh/torus sizing according to measured byte flows.
In XR, capacity thresholds, phase-aware scheduling, and elastic allocation mechanisms provide feedback loops from runtime/per-phase KPIs directly to resource provisioning, scratchpad staging units, and hardware-gather/scatter support (Shi et al., 15 Jan 2026).
In recommendation inference, archetype mapping guides both hardware (SIMD/FMA width, memory bandwidth, cache/decoder depth) and software (operator fusion, quantization, code layout, adaptive batching) optimizations, supporting batch-size-aware placement and resource allocation (Hsia et al., 2020).
5. Correlation Analysis and Optimization Guidance
Archetypal classification drives statistically grounded optimization strategies:
- Explicit correlation between model features (e.g., lookups per table, FC ratio) and pipeline bottlenecks quantifies hardware inefficiencies and enables targeted hardware/software co-optimization (Hsia et al., 2020).
- HPC center managers leverage archetype mapping to predict aggregate layer demands (memory, IOPS, bandwidth) and to tune system configuration—node types, queue policies, file system stripes—thus improving throughput and reducing backlog with minimal hardware investment (Simakov et al., 2018).
- XR design principles emphasize silicon allocation just sufficient to cross key capacity thresholds, flexible staging for algorithmic locality gaps, and latency-tolerant execution for irregular phases (Shi et al., 15 Jan 2026).
6. Archetype Tables for Comparative Reference
| Domain | Archetype | Dominant Metric(s) / Feature |
|---|---|---|
| Many-core (Dhanasekar et al., 2018) | C₂: Computation-complexity | : arithmetic/cycle cost |
| C₁: Communication-complexity | : bytes × depth | |
| C₃: Control-flow complexity | : branch probability vector | |
| C₄: Locality-of-reference | Randomized address/stride distributions | |
| HPC (Simakov et al., 2018) | Compute-bound (A) | High CPU-user, low/mid memory, low I/O |
| Memory-intensive (B) | High memory/core, low I/O, under-subscribed MPI | |
| I/O-bound (C) | High Lustre I/O, low CPU, many file opens | |
| Communication-heavy (D) | High IB bandwidth, deep MPI | |
| Single-node/throughput (E) | Minimal shared use, low/hi CPU | |
| XR (Shi et al., 15 Jan 2026) | Capacity-gated (I) | DRAM energy/latency “tipping point” |
| Flat-response (II) | High L2 hit, minimal cache scaling gain | |
| Balanced/cache-friendly (III) | Early LLC saturation, high SM_Util | |
| Irregular/overhead-sensitive (IV) | Control-dominated, low SM_Util | |
| Recommendation (Hsia et al., 2020) | Compute-bound (A) | High FC ratio, retiring IPC |
| Memory-bound (B) | Many embeddings, DRAM stalls | |
| Attention-bound (C) | Decoder/i-cache bound, branching | |
| Hybrid (D) | Mixed GRU/FC/embedding, balanced stalls |
7. Cross-Layer Archetypes in Perspective
Rooted in multidimensional metric analysis and statistical modeling, cross-layer workload archetypes serve as the canonical units by which applications are mapped to system resources, hardware architectures, language constructs, and scheduling policies. Their emergence in domains spanning exascale co-design (Dhanasekar et al., 2018), HPC system operations (Simakov et al., 2018), XR pipeline hardware (Shi et al., 15 Jan 2026), and personalized recommendation inference (Hsia et al., 2020) underscores their centrality for both empirical characterization and design automation. The closed-form, parameterized, and probabilistic instantiation of these archetypes enables rigorous benchmarking, stress testing, and elastic resource allocation—guiding the next generation of systems toward maximal power-performance optimization and deterministic execution under dynamic workload conditions.