Papers
Topics
Authors
Recent
2000 character limit reached

Heimdall Benchmark Suite

Updated 30 December 2025
  • Heimdall Benchmark Suite is an open-source, preemption-enabled FPGA evaluation framework featuring 27 real-world workloads.
  • It integrates context-saving and restoration hooks to enable reproducible and fair assessment of scheduling strategies in multi-tenant environments.
  • The suite supports flexible resource management via partitioned FPGA fabric and host-side scheduling for dynamic workload preemption.

Heimdall Benchmark Suite is an open-source, preemption-enabled suite designed for the evaluation of preemption strategies and scheduling policies in multi-tenant FPGA environments. It features 27 real-world workloads across cryptography, AI/ML, compute-intensive processing, communication systems, and multimedia domains. Each benchmark is equipped with integrated context-saving and restoration hooks to facilitate consistent and reproducible research in FPGA resource management and operating system design. Heimdall provides a standardized, extensible framework for methodical benchmarking, addressing the limitations of previous proprietary or synthetic approaches and enabling fair cross-comparison of scheduling algorithms (Malik et al., 10 Nov 2025).

1. Design Motivations and Objectives

The emergence of FPGAs as first-class cloud accelerators (e.g., AWS F1, Azure NP) has driven interest in multi-tenant operation, necessitating advanced support for preemptive multitasking. Vendor-native support for preemption is lacking, despite the growing demand for dynamic workload management. Preemption—defined as the sequence of pause, state-save, and resume operations—enables fine-grained time-multiplexing, superior utilization, and fairness in multi-tenant scenarios.

Prior research on FPGA preemption and scheduling has suffered from methodological fragmentation, primarily due to the reliance on ad-hoc or proprietary benchmarks (such as STFS, Coyote, or ReconOS). This lack of a standardized, domain-spanning suite with context-save/restore support has impeded reproducibility and comparability. Heimdall’s core objectives are to:

  • Establish the first open-source, preemption-enabled FPGA benchmark suite.
  • Cover a broad spectrum of 27 workloads representative of cryptography, AI/ML, computational kernels, communication, and multimedia domains.
  • Integrate context-saving and restoration hooks into all designs for transparent preemption.
  • Support systematic, reproducible evaluation of resource management and scheduling strategies.

2. System Architecture and Orchestration

2.1 Partitioned FPGA Fabric

Heimdall relies on Partial Reconfiguration (PR) to divide the programmable logic (PL) of an FPGA into distinct “slots,” each of which hosts a single accelerator. Slots can be sized heterogeneously according to the required resources (LUTs, flip-flops, BRAM, DSPs), enabling flexible mapping of different kernels.

2.2 Host-Side Scheduling Infrastructure

A lightweight OS-like controller implemented on the ARM processing system (PS) of a Xilinx Zynq device leverages the Processor Configuration Access Port (PCAP) interface to:

  • Trigger accelerator suspension via clock gating and quiescence.
  • Read or write configuration frames to perform context capture and restoration.
  • Manage DRAM buffers that temporarily hold off-chip state during preemption events.

Scheduling, including policy-based preemption (e.g., round-robin, priority queue), is enacted via this orchestration. The scheduling module determines which slots are suspended, saved, restored, and resumed according to configurable policies.

2.3 Context Save and Restore Implementation

Context is preserved at the granularity of configuration frames, which are read from and written to the PL via the PCAP. For each accelerator slot:

  • On save: NframesN_{\text{frames}} frames are sequentially read into a contiguous region in DRAM.
  • On restore: The saved NframesN_{\text{frames}} are written back before resuming the design.
  • Hooks for context operations are automatically inserted into the RTL or processor workload; manual logic modifications are not needed.

3. Benchmark Suite Composition

The 27 workloads in Heimdall are selected for real-world relevance, diversity of compute patterns, and variation in state size and latency.

PL-Based Hardware Accelerators (15)

  • AI/ML: Convolution Layer (neural-network inference), Scalable Matrix Multiplication (dense GEMM)
  • Multimedia/Image Processing: PNG Decoder, JPEG Decoder, H.264 Video Encoder, Generic Image Processor (filter/color convert)
  • Cryptography: ML-KEM Server (NIST post-quantum key encapsulation), ML-KEM Client
  • Signal & Math: FFT Accelerator (radix-2), IIR Filter, FIR Filter, Trigonometry Core (CORDIC/polynomial)
  • Communication/Interconnect: Viterbi Decoder, Open NoC (network-on-chip router fabric)
  • General Compute: MIPS Soft Processor (bare-metal)

RISC-V Processor-Based (12)

  • Cryptographic Primitives: AES-128 encryption, SHA-256 hashing, FALCON (lattice-based, with KeyGen, SignGen, SignVerify)
  • MachSuite Data-Parallel Kernels: BFS, Sort (quicksort), Needleman-Wunsch alignment, KMP search, dense GEMM
  • CPU Performance: Dhrystone, CoreMark
Category Benchmarks (examples) Characteristics
PL-based Conv Layer, FFT, H.264 Encoder RTL/HLS, varied state sizes
RISC-V processor AES-128, BFS, CoreMark SW-centric, data-parallel, CPUs

4. Context-Saving and Restoration Mechanisms

Context management is quantitative and tunable per benchmark slot. Let NframesN_{\text{frames}} denote the number of configuration frames, FsizeF_{\text{size}} the frame size in bits, and BbusB_{\text{bus}} the PCAP bandwidth.

  • State size: Sstate=Nframes×FsizeS_{\text{state}} = N_{\text{frames}} \times F_{\text{size}} (bits)
  • Save time: Tsave=Sstate/BsaveT_{\text{save}} = S_{\text{state}} / B_{\text{save}}
  • Restore time: Trestore=Sstate/BrestoreT_{\text{restore}} = S_{\text{state}} / B_{\text{restore}}
  • Total preemption overhead: Tpreempt=Tsave+TrestoreT_{\text{preempt}} = T_{\text{save}} + T_{\text{restore}}

For the Xilinx XC7Z020:

  • Fsize=101F_{\text{size}} = 101 words ×32\times 32 bits =3232= 3\,232 bits/frame
  • Bsave3.2B_{\text{save}} \approx 3.2 Gb/s (PCAP @ 100 MHz)
  • Tsave, per-frame1.01 μT_{\text{save, per-frame}} \approx 1.01\ \mus
  • For Nframes30N_{\text{frames}} \approx 30, Tsave30 μT_{\text{save}} \approx 30\ \mus,Trestore32 μ,\,T_{\text{restore}} \approx 32\ \mus,Tpreempt62 μ,\,T_{\text{preempt}} \approx 62\ \mus$</li> </ul> <p>This strict methodology for state management facilitates reproducible measurement of preemption costs across domains.</p> <h2 class='paper-heading' id='scheduling-and-evaluation-metrics'>5. Scheduling and Evaluation Metrics</h2> <p>Heimdall standardizes the measurement of preemption and scheduling through canonical metrics:</p> <ul> <li><strong>Fairness Index (Jain&#39;s):</strong> $F = \frac{(\sum_i x_i)^2}{n \sum_i x_i^2},where, where x_iisthethroughputorjobcountfortenant is the throughput or job count for tenant i,, nisthenumberoftenants.</li><li><strong>ResourceUtilization:</strong> is the number of tenants.</li> <li><strong>Resource Utilization:</strong> U = \frac{\sum_t \text{active\_time}_t}{T_{\text{total}}},averagedoverall, averaged over all t.</li><li><strong>SchedulingLatency:</strong>.</li> <li><strong>Scheduling Latency:</strong> L_{\text{sched}}isthemeanintervalfrompreemptionrequesttoquiesce/restore.</li><li><strong>TurnaroundTime:</strong> is the mean interval from preemption request to quiesce/restore.</li> <li><strong>Turnaround Time:</strong> T_{\text{turn}} = \text{waiting time} + T_{\text{execution}} + (\text{number of preempts} \times T_{\text{preempt}})</li></ul><p>Allmetricsareevaluatedoveraschedulingepochofduration</li> </ul> <p>All metrics are evaluated over a scheduling epoch of duration T_{\text{total}}.Thisconsistentframeworkpermitsapplestoapplescomparisonofvariedscheduling(e.g.,fairness,utilization,latency)andpreemptionprotocols.</p><h2class=paperheadingid=benchmarkingsetupandworkflow>6.BenchmarkingSetupandWorkflow</h2><h3class=paperheadingid=hardwareplatform>HardwarePlatform</h3><ul><li>XilinxZynq7000<ahref="https://www.emergentmind.com/topics/gap9systemonchipsoc"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">SoC</a>(XC7Z020,ARMCortexA9withPLfabric)</li><li>Vivado2018.2toolchain;PCAPinterfaceat100MHz</li></ul><h3class=paperheadingid=softwarestack>SoftwareStack</h3><ul><li>CbasedPSsideorchestrationusingXilinxSDK2018.2</li><li>PCAPdriverforconfigurationaccess</li><li>Schedulingdaemonwithconfigurablequantumandpolicies</li></ul><h3class=paperheadingid=experimentalprocedure>ExperimentalProcedure</h3><ol><li>Synthesizeeachbenchmarkwithpreemptionhooks.</li><li>PartitionPLintothedesirednumberofslots;mapacceleratorsviaPR.</li><li>Foreachexperimentalrun:<ul><li>Triggerstatesave;recordtimestampsfor. This consistent framework permits apples-to-apples comparison of varied scheduling (e.g., fairness, utilization, latency) and preemption protocols.</p> <h2 class='paper-heading' id='benchmarking-setup-and-workflow'>6. Benchmarking Setup and Workflow</h2><h3 class='paper-heading' id='hardware-platform'>Hardware Platform</h3> <ul> <li>Xilinx Zynq-7000 <a href="https://www.emergentmind.com/topics/gap9-system-on-chip-soc" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">SoC</a> (XC7Z020, ARM Cortex-A9 with PL fabric)</li> <li>Vivado 2018.2 toolchain; PCAP interface at 100 MHz</li> </ul> <h3 class='paper-heading' id='software-stack'>Software Stack</h3> <ul> <li>C-based PS-side orchestration using Xilinx SDK 2018.2</li> <li>PCAP driver for configuration access</li> <li>Scheduling daemon with configurable quantum and policies</li> </ul> <h3 class='paper-heading' id='experimental-procedure'>Experimental Procedure</h3> <ol> <li>Synthesize each benchmark with preemption hooks.</li> <li>Partition PL into the desired number of slots; map accelerators via PR.</li> <li>For each experimental run: <ul> <li>Trigger state save; record timestamps for T_{\text{save}}.</li><li>Triggerstaterestore;recordfor.</li> <li>Trigger state restore; record for T_{\text{restore}}.</li><li>Logthroughput,latency,andenergymeasurements(ifavailable).</li></ul></li><li>Postprocessdatatocompute.</li> <li>Log throughput, latency, and energy measurements (if available).</li> </ul></li> <li>Post-process data to compute F,, U,, L_{\text{sched}},, T_{\text{preempt}},and, and T_{\text{turn}}foreachpolicy.</li></ol><h2class=paperheadingid=extensibilityandguidelinesforaddingnewbenchmarks>7.ExtensibilityandGuidelinesforAddingNewBenchmarks</h2><p>ResearcherscanextendHeimdallasfollows:</p><ol><li>DefineanewPRregioninVivado(area,interface,clock).</li><li>AdapttheRTL/HLSkernelwithpause/resumehooksforallstatefullogic(FFs,BRAMs,DSPregs)andimplementframeread/writeloopsperPScontrol.</li><li>Specify for each policy.</li> </ol> <h2 class='paper-heading' id='extensibility-and-guidelines-for-adding-new-benchmarks'>7. Extensibility and Guidelines for Adding New Benchmarks</h2> <p>Researchers can extend Heimdall as follows:</p> <ol> <li>Define a new PR region in Vivado (area, interface, clock).</li> <li>Adapt the RTL/HLS kernel with pause/resume hooks for all stateful logic (FFs, BRAMs, DSP regs) and implement frame-read/write loops per PS control.</li> <li>Specify N_{\text{frames}},, F_{\text{size}},andDRAMaddressfor, and DRAM address for S_{\text{state}}inthescheduler.</li><li>Developafunctionaltestthatexercisespauseresumeatrandomintervalsandchecksforbitidenticaloutput.</li><li>Contributethefollowingtotherepository:<ul><li>Sourcecode(RTL/HLS),blockdesign,preemptionwrapper</li><li>PScontrolcodesnippetforsave/restoreinvocation</li><li>Documentation(README)detailingresourceusage,latency,and in the scheduler.</li> <li>Develop a functional test that exercises pause–resume at random intervals and checks for bit-identical output.</li> <li>Contribute the following to the repository: <ul> <li>Source code (RTL/HLS), block design, preemption wrapper</li> <li>PS control code snippet for save/restore invocation</li> <li>Documentation (README) detailing resource usage, latency, and S_{\text{state}}$

This rigorous process ensures all contributed benchmarks adhere to Heimdall’s reproducibility and preemption standards, supporting robust, comparative research in multi-tenant FPGA management (Malik et al., 10 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Heimdall Benchmark Suite.