Papers
Topics
Authors
Recent
Search
2000 character limit reached

SDFG-Smith in DaCe Framework

Updated 16 March 2026
  • SDFG-Smith is a Library Node in the DaCe framework that implements the Smith-Waterman algorithm with explicit data dependencies for FPGA optimization.
  • The methodology leverages hierarchical abstraction levels to systematically lower high-level dynamic programming recurrences for efficient, portable code generation on heterogeneous FPGA platforms.
  • Performance models and analytic resource estimation guide the autotuning of tiling, streaming, and pipelining optimizations to maximize throughput and reduce latency.

The SDFG-Smith node in the DaCe (Data-Centric parallel programming) framework implements the Smith-Waterman sequence alignment algorithm using the Stateful DataFlow multiGraph (SDFG) representation. This model exposes all data movements and computational dependencies, enabling programmatic optimization and efficient code generation across heterogeneous FPGA platforms through an automated, multi-level abstraction workflow. SDFG-Smith exemplifies the use of DaCe's Library Nodes, which embody domain-specific computations as extensible, parameterizable subgraphs, facilitating high-performance, portable hardware implementations (Licht et al., 2022).

1. DaCe Framework and SDFG Fundamentals

At the core of DaCe is the SDFG, a formal six-tuple graph G=(V,E,src,dst,attrV,attrE)G = (V, E, \mathit{src}, \mathit{dst}, \mathit{attr}_V, \mathit{attr}_E) where nodes VV include states (control-flow points), data containers (memory interfaces), computation tasklets (fine-grained operations), and map entries/exits (parameterized parallelism scopes). Edges, termed memlets, are directional and carry data descriptors, such as array subsets and transfer volumes. States encapsulate purely data-driven subgraphs; tasklets may access only data explicitly transmitted over memlets, exposing all dependencies for analysis and transformation. Maps introduce parametric parallelism, while inter-state control flow enables structured loops, conditionals, and nesting. This explicit exposition of data movement and control delineates SDFG from traditional imperative IRs and is central to DaCe's optimization capabilities (Licht et al., 2022).

2. Multi-Level Library Node Abstraction

The Library Node concept introduces modular, abstract computation units endowed with named connectors (for input/output) and high-level semantics, such as “Smith-Waterman,” “matrix multiply,” or “stencil update.” Each Library Node is not executable until it is “lowered” through a series of hierarchical abstraction levels:

  1. Logical (domain model): Specifies the abstract recurrence, such as the Smith-Waterman DP formulation or a GEMM signature.
  2. Dataflow (SDFG): Instantiates control and data dependencies using SDFG primitives: maps for parallelism, tasklets for computing, and memlets for transfers. Tiling, streaming, and fusion are explicitly represented.
  3. Storage: Maps high-level data containers onto concrete hardware resources such as DRAM, BRAM, or shift registers.
  4. Platform: Emits vendor-specific code structures and annotations, including pragmas, data types, and host interface protocols (e.g., Vivado HLS C++ vs. Intel OpenCL).

At each level, domain-specific and general transformations (tiling, buffer mapping, loop pipelining) may be systematically introduced, supporting retargeting and optimization across applications and platforms (Licht et al., 2022).

3. SDFG-Smith Node: Formalization and Lowering Process

The SDFG-Smith node encodes the Smith-Waterman alignment for sequences A[0...M]A[0...M] and B[0...N]B[0...N], parameterized by a similarity function s(i,j)s(i,j) and gap penalty gg, and produces an alignment matrix H[(M+1)×(N+1)]H[(M+1)\times(N+1)] and score. The recurrence is

Hi,j=max{0,  Hi1,j1+s(i,j),  Hi1,jg,  Hi,j1g},1iM,1jN.H_{i,j} = \max\bigl\{0, \; H_{i-1,j-1} + s(i,j), \; H_{i-1,j} - g, \; H_{i, j-1} - g\bigr\}, \quad 1 \leq i \leq M, 1 \leq j \leq N.

Formally, LibraryNode("Smith", connectors=[A,B,s,g], outputs=[H]). The canonical lowering proceeds via:

  • init state: Zero-initializes the first row/column of HH.
  • compute state: Nested maps MapiMap_i, MapjMap_j iterate over rows and columns, enclosing a tasklet "UpdateCell" that reads the necessary neighboring cells and writes H(i,j)H(i,j).
  • finalize state: Applies reduction to extract the best alignment score (or backtrace).
  • memlets: Annotate each data transfer, specifying subset mappings such as H(i1,j)H(i-1,j) to H(i,j)H(i,j) with explicit element-wise dependencies.

This hierarchical formulation directly exposes algorithmic parallelism (e.g., wavefront parallelism via tiling/fusion) and dependencies suitable for hardware pipelining (Licht et al., 2022).

4. Performance Models and Resource Estimation

For a fully pipelined jj dimension, hardware loops can achieve pipeline initiation interval II=1II=1 (one cell per cycle). The total latency per row is IIN+TfillN+δII \cdot N + T_{\text{fill}} \approx N + \delta; over all rows, LtotalMN+O(M+N)L_{\text{total}} \approx M \cdot N + O(M+N). Resource use is analytically parameterizable:

  • DSP units: αTiTj\approx \alpha \cdot T_i \cdot T_j, with α1.5\alpha \approx 1.5 (each DP cell: one multiply–add for substitution, two comparisons).
  • BRAM blocks: β(TjW/BRAM_DEPTH+TiW/BRAM_DEPTH)\approx \beta \left(\lceil T_j W / \text{BRAM\_DEPTH}\rceil + \lceil T_i W / \text{BRAM\_DEPTH}\rceil\right), where WW is word width in bits, β\beta accounts for packing, BRAM_DEPTH is typically 1024 or 2048 (for two wavefront buffers).

These models underpin autotuning and design-space exploration at the SDFG and code generation stages (Licht et al., 2022).

5. Platform-Specific FPGA Code Generation

The fully-lowered SDFG is partitioned into "kernels" (subgraphs with on-chip memory residency), then code generation is performed by a platform-specific backend:

  • Xilinx (Vivado HLS): Emits C++ functions with AXI interfaces and #pragma HLS annotations, such as DATAFLOW, pipeline initiation, and UNROLL factors for inner loops. hlslib::Stream constructs mediate inter-PE communication.
  • Intel (OpenCL): Emits separate __kernel functions for weakly connected PEs, linked by cl::channels, with platform-specific unrolling and dependency annnotations.
  • Optimizations: Code generation injects tiling along Map scopes, stream depths, array reshaping/partition, and loop flattening/merging pragmas to match the intended execution pipeline and resource usage.

Minor tunable parameters (e.g., tile sizes, pragma depths) may be adjusted per backend, but the SDFG and LibraryNode definitions are invariant, ensuring design portability (Licht et al., 2022).

6. Performance and Portability Characteristics

The analytical throughput model predicts a per-cell throughput of fclk/IIf_{\text{clk}} / II; at fclk=300f_{\text{clk}}=300 MHz and II=1II=1, this is $300$ million cells/s. For a 1024×10241024\times1024 alignment, the practical runtime is approximately $3.5$ ms, plus system and I/O overhead. Empirical measurements are:

Platform Throughput Resource Usage
Xilinx Alveo U250 (Vivado) \approx200 M cells/s \sim128 DSPs, \sim64 BRAMs
Intel Stratix 10 (OpenCL) \approx230 M cells/s \sim160 DSPs, \sim80 M20Ks

The fundamental portability stems from the invariant high-level SDFG and LibraryNode; only backend code emission adapts to the vendor toolchain. Minor retargeting (e.g., tile or stream depths) optimizes for hardware, but does not alter the algorithmic abstraction (Licht et al., 2022).

7. Broader Implications and Methodological Significance

The SDFG-Smith node demonstrates the ability of the DaCe/Library Node paradigm to encode complex dynamic programming recurrences with explicit, analyzable dataflow, supporting both generic and domain-specific transformations. This approach enables systematic multi-level optimization, performance modeling, and automated retargeting, thereby facilitating hardware-efficient, portable implementations with significantly reduced manual intervention. A plausible implication is the extension of this methodology to a broader class of DP-algorithms and scientific kernels, promoting reproducibility, maintainability, and cross-platform scalability in FPGA-accelerated computation (Licht et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SDFG-Smith (DaCe).