Orochi Abstraction Layer in HPC

Updated 11 March 2026

Orochi Abstraction Layer is a performance portability system designed for heterogeneous HPC environments, enabling scientific simulations to efficiently target CPUs, GPUs, and future accelerators.
It integrates a high-level control-flow mapper, variant specification language, and runtime orchestrator to map tasks and optimize data movement across diverse devices.
Empirical evaluations with Flash-X show that ORCHA improves execution through simplified recipe-based configuration and robust device-specific optimizations.

The Orochi Abstraction Layer ("ORCHA") is a performance portability system engineered to address the challenges imposed by heterogeneity in post-exascale high-performance computing (HPC) environments. Its design focuses on enabling scientific simulation codes—exemplified by Flash-X—to efficiently target platforms consisting of CPUs, GPUs, and potentially future accelerators or chiplets, by orchestrating control flow, data movement, and kernel specialization across diverse devices using a unified, composable toolchain (Lee et al., 12 Jul 2025).

1. Architectural Composition and Application Integration

ORCHA comprises three strictly independent components, each targeting a core abstraction layer vital for performance portability in heterogeneous simulation workflows:

High-Level Control-Flow Mapper (CG-Kit): Accepts a user-defined "recipe" specifying a sequence or partial order of tasks, device mappings (CPU/GPU), and inter-task dependencies. It emits an optimized directed-acyclic graph (DAG) of TaskFunctions and a Parameterized Source Tree (PST) for subsequent code generation.
Variant Specification Language (Macroprocessor): Expands macro-annotated, language-agnostic sources to device-specific realizations (e.g., differing array layouts, parallelization strategies, or arithmetic kernel variants), supporting inheritance and device-based overrides.
Runtime Orchestrator (Milhoja): At execution time, instantiates thread teams, schedules TaskFunctions per the DAG, orchestrates data movement (CPU↔GPU), and exploits explicit concurrency primitives (e.g., device streams, host threads) for latency hiding.

The interaction with application layer code is mediated through FlashX-RecipeTools—a thin Python interface that parses user recipes, manages mapping of Fortran subroutines (identified and annotated statically), invokes the control-flow mapper and macroprocessor, and handles instantiation and compilation of all generated artifacts into a single ORCHA-enabled binary.

Process Flow:

$t_1, ..., t_n$ 5

2. Control-Flow Representation and Mapping Formalism

The core of ORCHA’s scheduling abstraction is a domain-specific language, expressed through Python APIs, that enables declarative specification of task orchestration. The formal syntax in Backus–Naur Form is:

$\begin{array}{rcl} \langle \mathit{recipe} \rangle &::=& \langle \mathit{stmt}\rangle^* \ \langle \mathit{stmt}\rangle &::=& \langle \mathit{begin}\rangle\mid \langle \mathit{add\_work}\rangle \mid \langle \mathit{end}\rangle \ \langle \mathit{begin}\rangle &::=& \texttt{begin\_orchestration(} \texttt{after=}\langle\mathit{node}\rangle\texttt{)}\ \langle \mathit{add\_work}\rangle &::=& \texttt{add\_work(} \texttt{"}\langle\mathit{id}\rangle\texttt{",} \texttt{after=}\langle \mathit{nodelist}\rangle\texttt{,} \texttt{map\_to=}\langle\mathit{device}\rangle\texttt{)}\ \langle \mathit{end}\rangle &::=& \texttt{end\_orchestration(} \texttt{begin\_node=}\langle\mathit{node}\rangle\texttt{, after=}\langle\mathit{node}\rangle\texttt{)}\ \langle\mathit{device}\rangle &::=& \texttt{"CPU"} \mid \texttt{"GPU"} \mid \dots \ \end{array}$

Semantically, the system constructs a DAG of tasks $t_1, ..., t_n$ , with device assignments $M: T \to D$ (where $T$ is the set of tasks, $D$ devices) set by recipe-level annotation. The associative cost minimization model is:

$C_\mathrm{total}(M) =\sum_{i=1}^n \left[T_{\mathrm{compute}}\bigl(t_i, M(t_i)\bigr) + T_{\mathrm{transfer}}\bigl(t_i, \mathrm{pred}(t_i)\bigr)\right]$

where $T_{\mathrm{compute}}$ is per-device kernel time and $T_{\mathrm{transfer}}$ accounts for inter-device data movement. A plausible implication is that cost models can be integrated for automated mapping optimization, though domain practitioners may also supply these by hand.

3. Macroprocessor and Data/Kernel Variant Abstraction

The Macroprocessor system enables rigorous encapsulation of data-structure and arithmetic-kernel variants across hardware targets. Macro invocations in source denote abstract data or control constructs, e.g., declarations and parallel loop delimiters.

For example, the following macro expresses a device-dependent array declaration:

Pseudocode: $t_1, ..., t_n$ 6

Type mapping in LaTeX: $\mathrm{DATA\_DECL}_{\mathrm{GPU}}(\mathit{field},n_x,n_y,n_z) \;\longrightarrow\; \texttt{real, device, allocatable :: field\_d(n_x,n_y,n_z)}$

Macros for arithmetic kernels can wrap domain logic, while loop macros for device-specific parallelization (e.g., OpenACC on GPU) allow granular specialization without rewriting the computational core. Macro inheritance enables a base implementation with only device-specific deltas in separate files.

4. Runtime Orchestrator: Scheduling and Data Movement

At runtime, Milhoja instantiates one or more thread teams mapped to hardware resources. Each TaskFunction, instantiated from the DAG, is assigned to a specific team (e.g., pinned CPU core set, GPU + helper CPUs) and executed in dependency-respecting order. Datapaths are mediated by two main wrappers:

DataPacket: Aggregates multiple blocks (e.g., from an AMR mesh) into a contiguous GPU-transferable buffer, utilizing cudaMemcpyAsync and streams.
TileWrapper: Non-transferring wrapper for CPU tasks.

Per-GPU task execution:

for each TaskFunction t assigned to GPU:
  stream = get_next_cuda_stream()
  cudaMemcpyAsync(dst=GPU_buffer, src=DataPacket.buffer, stream)
  kernel_launch<<<...,(stream)>>>(GPU_buffer, ...)
  // data remains resident on device until next needed transfer
 $t_1, ..., t_n$ 7c++
extern "C" void run_TF1_cuda(
    DataPacket_TF1 const& pkt,
    Milhoja::TeamHandle   team,
    cudaStream_t          stream) {
  cudaMemcpyAsync(d_pkt.density, pkt.density, pkt.nBytes, cudaMemcpyHostToDevice, stream);
  dim3 grid = makeGrid(pkt.numElements);
  dim3 block = makeBlock();
  TF1_kernel<<<grid, block, 0, stream>>>(d_pkt.density, d_pkt.energy, /*…*/);
  cudaMemcpyAsync(pkt.density, d_pkt.density, pkt.nBytes, cudaMemcpyDeviceToHost, stream);
}
 $t_1, ..., t_n$ 8c++
void run_TF1_cpu(DataPacket_TF1 const& pkt, Milhoja::TeamHandle team) {
  #pragma omp parallel num_threads(team.size())
  {
    compute_TF1_cpu(pkt.density, pkt.energy, /*…*/);
  }
}

6. Empirical Evaluation and Configuration Space

Application of ORCHA in Flash-X demonstrates extensive configurability with minimal source-level disruption—distinct device mappings are realized by varying only the orchestration recipe, not underlying kernels or data structures.

Configuration Table:

Configuration	Device Mapping	Recipe Fragment Example
GPU-centric ("GPU-only")	All Hydro+EOS on GPU (4 GPUs)	add_work("Hydro", after=begin, map_to="GPU") <br> add_work("EOS", after="Hydro", map_to="GPU")
CPU/GPU-balanced	60 blocks CPU, 50 blocks GPU per cycle	add_work("Hydro", after=begin, map_to="GPU", blocks=50) <br> add_work("Hydro", after=begin, map_to="CPU", blocks=60) <br> add_work("EOS", after=[...], map_to="GPU")
CPU/GPU-concurrent	Hydro on GPU, Burn network on CPU in parallel	add_work("Hydro", map_to="GPU") <br> add_work("Burn", map_to="CPU", after=begin)

Empirical performance observations:

In the Sedov 3D benchmark, the GPU-only mapping with ORCHA achieved $T_{\mathrm{ORCHA}} = 8.5\,s$ , closely matching the CPU baseline of $t_1, ..., t_n$ 0, and far outperforming GPU-only OpenACC without packetization ( $t_1, ..., t_n$ 1) for sufficient DataPacket sizes.
For a more complex Cellular 2D scenario, ORCHA-GPU-only achieved $t_1, ..., t_n$ 2 (33.8% speedup over CPU), while the CPU/GPU-concurrent configuration further reduced time to $t_1, ..., t_n$ 3 (38.0% speedup). The CPU/GPU-balanced setting degraded performance ( $t_1, ..., t_n$ 4, 9.3% worse), suggesting some mapping regimes are suboptimal.

Runtime plots present runtime vs. number of blocks per DataPacket, with ORCHA showing improved scaling as packet size increases, flattening once device utilization saturates.

ORCHA’s abstraction, layering, and macro-based synthesis collectively enable node-level mapping flexibility, robust device specialization, and efficient use of heterogeneous parallelism for complex HPC applications, all while preserving the maintainability of ever-evolving scientific codes (Lee et al., 12 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

ORCHA -- A Performance Portability System for Post-Exascale Systems (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Orochi Abstraction Layer.