SDFG-Smith in DaCe Framework
- SDFG-Smith is a Library Node in the DaCe framework that implements the Smith-Waterman algorithm with explicit data dependencies for FPGA optimization.
- The methodology leverages hierarchical abstraction levels to systematically lower high-level dynamic programming recurrences for efficient, portable code generation on heterogeneous FPGA platforms.
- Performance models and analytic resource estimation guide the autotuning of tiling, streaming, and pipelining optimizations to maximize throughput and reduce latency.
The SDFG-Smith node in the DaCe (Data-Centric parallel programming) framework implements the Smith-Waterman sequence alignment algorithm using the Stateful DataFlow multiGraph (SDFG) representation. This model exposes all data movements and computational dependencies, enabling programmatic optimization and efficient code generation across heterogeneous FPGA platforms through an automated, multi-level abstraction workflow. SDFG-Smith exemplifies the use of DaCe's Library Nodes, which embody domain-specific computations as extensible, parameterizable subgraphs, facilitating high-performance, portable hardware implementations (Licht et al., 2022).
1. DaCe Framework and SDFG Fundamentals
At the core of DaCe is the SDFG, a formal six-tuple graph where nodes include states (control-flow points), data containers (memory interfaces), computation tasklets (fine-grained operations), and map entries/exits (parameterized parallelism scopes). Edges, termed memlets, are directional and carry data descriptors, such as array subsets and transfer volumes. States encapsulate purely data-driven subgraphs; tasklets may access only data explicitly transmitted over memlets, exposing all dependencies for analysis and transformation. Maps introduce parametric parallelism, while inter-state control flow enables structured loops, conditionals, and nesting. This explicit exposition of data movement and control delineates SDFG from traditional imperative IRs and is central to DaCe's optimization capabilities (Licht et al., 2022).
2. Multi-Level Library Node Abstraction
The Library Node concept introduces modular, abstract computation units endowed with named connectors (for input/output) and high-level semantics, such as “Smith-Waterman,” “matrix multiply,” or “stencil update.” Each Library Node is not executable until it is “lowered” through a series of hierarchical abstraction levels:
- Logical (domain model): Specifies the abstract recurrence, such as the Smith-Waterman DP formulation or a GEMM signature.
- Dataflow (SDFG): Instantiates control and data dependencies using SDFG primitives: maps for parallelism, tasklets for computing, and memlets for transfers. Tiling, streaming, and fusion are explicitly represented.
- Storage: Maps high-level data containers onto concrete hardware resources such as DRAM, BRAM, or shift registers.
- Platform: Emits vendor-specific code structures and annotations, including pragmas, data types, and host interface protocols (e.g., Vivado HLS C++ vs. Intel OpenCL).
At each level, domain-specific and general transformations (tiling, buffer mapping, loop pipelining) may be systematically introduced, supporting retargeting and optimization across applications and platforms (Licht et al., 2022).
3. SDFG-Smith Node: Formalization and Lowering Process
The SDFG-Smith node encodes the Smith-Waterman alignment for sequences and , parameterized by a similarity function and gap penalty , and produces an alignment matrix and score. The recurrence is
Formally, LibraryNode("Smith", connectors=[A,B,s,g], outputs=[H]). The canonical lowering proceeds via:
- init state: Zero-initializes the first row/column of .
- compute state: Nested maps , iterate over rows and columns, enclosing a tasklet "UpdateCell" that reads the necessary neighboring cells and writes .
- finalize state: Applies reduction to extract the best alignment score (or backtrace).
- memlets: Annotate each data transfer, specifying subset mappings such as to with explicit element-wise dependencies.
This hierarchical formulation directly exposes algorithmic parallelism (e.g., wavefront parallelism via tiling/fusion) and dependencies suitable for hardware pipelining (Licht et al., 2022).
4. Performance Models and Resource Estimation
For a fully pipelined dimension, hardware loops can achieve pipeline initiation interval (one cell per cycle). The total latency per row is ; over all rows, . Resource use is analytically parameterizable:
- DSP units: , with (each DP cell: one multiply–add for substitution, two comparisons).
- BRAM blocks: , where is word width in bits, accounts for packing, BRAM_DEPTH is typically 1024 or 2048 (for two wavefront buffers).
These models underpin autotuning and design-space exploration at the SDFG and code generation stages (Licht et al., 2022).
5. Platform-Specific FPGA Code Generation
The fully-lowered SDFG is partitioned into "kernels" (subgraphs with on-chip memory residency), then code generation is performed by a platform-specific backend:
- Xilinx (Vivado HLS): Emits C++ functions with AXI interfaces and
#pragma HLSannotations, such as DATAFLOW, pipeline initiation, and UNROLL factors for inner loops. hlslib::Stream constructs mediate inter-PE communication. - Intel (OpenCL): Emits separate
__kernelfunctions for weakly connected PEs, linked bycl::channels, with platform-specific unrolling and dependency annnotations. - Optimizations: Code generation injects tiling along Map scopes, stream depths, array reshaping/partition, and loop flattening/merging pragmas to match the intended execution pipeline and resource usage.
Minor tunable parameters (e.g., tile sizes, pragma depths) may be adjusted per backend, but the SDFG and LibraryNode definitions are invariant, ensuring design portability (Licht et al., 2022).
6. Performance and Portability Characteristics
The analytical throughput model predicts a per-cell throughput of ; at MHz and , this is $300$ million cells/s. For a alignment, the practical runtime is approximately $3.5$ ms, plus system and I/O overhead. Empirical measurements are:
| Platform | Throughput | Resource Usage |
|---|---|---|
| Xilinx Alveo U250 (Vivado) | 200 M cells/s | 128 DSPs, 64 BRAMs |
| Intel Stratix 10 (OpenCL) | 230 M cells/s | 160 DSPs, 80 M20Ks |
The fundamental portability stems from the invariant high-level SDFG and LibraryNode; only backend code emission adapts to the vendor toolchain. Minor retargeting (e.g., tile or stream depths) optimizes for hardware, but does not alter the algorithmic abstraction (Licht et al., 2022).
7. Broader Implications and Methodological Significance
The SDFG-Smith node demonstrates the ability of the DaCe/Library Node paradigm to encode complex dynamic programming recurrences with explicit, analyzable dataflow, supporting both generic and domain-specific transformations. This approach enables systematic multi-level optimization, performance modeling, and automated retargeting, thereby facilitating hardware-efficient, portable implementations with significantly reduced manual intervention. A plausible implication is the extension of this methodology to a broader class of DP-algorithms and scientific kernels, promoting reproducibility, maintainability, and cross-platform scalability in FPGA-accelerated computation (Licht et al., 2022).