HLS Dialect for Synthesizable Hardware

Updated 13 November 2025

HLS Dialect is a specialized programming subset with explicit syntactic restrictions and pragmas to ensure hardware synthesizability.
It employs intermediate representations like region-based, multi-level, and SSA-based IRs to enable optimizations such as loop unrolling, pipelining, and array partitioning.
Recent methodologies leverage LLM-assisted code generation and structured feedback to improve synthesis efficiency and performance metrics.

High-Level Synthesis (HLS) dialect refers to a family of programming language subsets, intermediate representations, and syntactic conventions devised to enable the automatic transformation of software-level code into hardware circuit designs, especially for FPGAs and custom accelerators. These dialects are characterized by restricted and extended versions of C/C++ (sometimes functional or domain-specific languages), explicit optimization hints, transformation-friendly grammars, and systematic mapping rules to guarantee synthesizability and hardware efficiency. This article presents the technical foundations, formal grammars, intermediate representations, optimization mechanisms, and evaluation patterns underlying state-of-the-art HLS dialects, including those leveraged for dynamic scheduling, LLM-assisted code generation, and multi-level IR frameworks.

1. Formal Syntax and Semantics

The canonical HLS dialect adopts a subset of C++ with explicit extensions and restrictions to ensure hardware compatibility. The permitted syntax encompasses static arrays, arbitrary-precision types (ap_int<W>, ap_fixed<TOT,INT>), streaming FIFO types (hls::stream<T>), and vendor-defined pragmas, while disallowing dynamic memory allocation, recursion, unrestricted pointer arithmetic, and other non-static constructs (Zou et al., 6 Jul 2025, Khan et al., 5 Aug 2025, Gai et al., 19 Feb 2025). Minimal grammars are given in BNF form; one formalization is:

<program>       ::= { <global-decl> }
<global-decl>   ::= <function-def> | <var-decl> | <pragma>
<pragma>        ::= "#pragma" "HLS" <pragma-args>
<pragma-args>   ::= "PIPELINE" ["II=" <int>] | "UNROLL" ["factor=" <int>]
                | "ARRAY_PARTITION" ["variable=" <ident>] ["type=" <partition-type>]
                | "RESOURCE" ["variable=" <ident>] "core=" <ident>
                | "INLINE" | "DATAFLOW" | "LOOP_MERGE"
                | "DEPENDENCE" ["variable=" <ident>] "inter" <bool>
                | "STREAM" ["variable=" <ident>] "depth=" <int>
<partition-type>::= "complete" | "cyclic" | "block"
<type-spec>     ::= "int" | "float" | "ap_int" "<" <int> ">" | ... 
<function-def>  ::= <type-spec> <ident> "(" <param-list> ")" "{" <stmt-list> "}"

Pragmas are encoded at loop/function/array scope and dictate transformation, scheduling, and resource sharing during synthesis.

2. Intermediate Representations and Compilation Pipelines

HLS dialects are mapped to intermediate representations (IRs) that support transformation, analysis, and translation to RTL. Three dominant IR paradigms are currently in use:

Region-based IR (R-HLS): Formalized as $G = (N, E, S, I, O, \mathit{type})$ where $N$ classifies nodes by data/control/memory/region, $E$ denotes data-flow edges, $S$ captures state-edge (handshake token) dependencies, and region nodes recursively encode control-flow constructs as multiplexed subgraphs (Metz et al., 16 Aug 2024).
Multi-level IR (ScaleHLS): MLIR-based layers progress from graph-level (ONNX/tensor), through SCF/affine loop bands, to directive-enriched hlscpp dialect. Attributes annotate functions ( $\mathrm{func\_directive}$ ), loops ( $\mathrm{loop\_directive}$ ), and arrays (memref affine maps) for downstream codegen (Ye et al., 2021).
SSA-based IR (AnyHLS): Based on statically-typed Impala source, partial evaluation flattens control and lifts hardware generators (unroll, pipeline, etc.) into vendor-agnostic SSA IR, which maps directly to residual synthesizable C++/OpenCL code (Özkan et al., 2020).

These IRs mediate transformations such as loop unrolling, pipelining, dataflow decomposition, array partitioning, and memory disambiguation, supporting automated design-space exploration and performance/resource estimation.

3. Pragmas and Optimization Directives

Optimization pragmas are encoded in HLS dialects to control hardware performance, resource use, and parallelism. Major directives and their semantics are summarized as:

Pragma	Semantics/Impact
PIPELINE II=N	Loop/function can initiate every N cycles; lowers II, increases reg/LUT usage
UNROLL factor=K	Duplicate loop K-way; reduces loop latency, increases DSP/LUT/FF cost
ARRAY_PARTITION	Splits array for parallel access; increases mem banks, removes conflicts
RESOURCE core=X	Forces mapping of buffer to specific primitive (e.g., RAM_2P)
DATAFLOW	Enables concurrent execution of functions/loops; instantiates FIFOs
STREAM depth=D	Instantiates hls::stream FIFO of given depth
DEPENDENCE inter=false	Declares absence of loop-carried dependence; permits aggressive transformation
LOOP_MERGE	Fuses loops to expose further pipeline parallelism
INLINE	Inlines small functions; removes call/return overhead

These directives collectively control initiation intervals, concurrency, resource mapping, data dependencies, and streaming behavior as required by the target architecture (Zou et al., 6 Jul 2025, Gai et al., 19 Feb 2025).

4. Transformation Taxonomy and Performance Modeling

The HLS dialect supports five transformation classes: code restructuring (tiling, fusion), directive/pragmas insertion, data-type adaptation (to bit-accurate types), function intrinsics (replacing math calls with hls::sqrt, hls::exp), and code repair (removing recursion/dynamic allocation) (Zou et al., 6 Jul 2025). Formal models guide design-space analysis:

Latency: For pipelined loop, $\mathrm{Latency} \approx \mathrm{II}\times (N-1) + \mathrm{PipelineDepth}$
Resource Usage: $\mathrm{Usage}_{\mathrm{DSP/LUT/FF}} \propto \text{unroll\_factor} \times \text{per-iteration cost} + \text{control logic}$
Array Partitioning: Encoded as affine maps, e.g., $\mathrm{memref}<N,\text{type},(d_0) \to (d_0~\%~4, \lfloor d_0/4 \rfloor)>$

Design-space exploration (DSE) leverages Pareto-front search and iterative neighbor updates under latency/area constraints (Ye et al., 2021).

5. Advanced IR Concepts: Regions and State Edges (R-HLS)

R-HLS (Metz et al., 16 Aug 2024) formally reifies control and memory ordering as region nodes and state edges within a single global data-flow graph:

Region Node: $r = (\{R_i\}_{i=1}^k, \mathrm{argMap}, \mathrm{resMap})$ , disjoint subgraphs representing conditionals, switches, or loop bodies.
State Edge: $(n_1,n_2)\in S$ enforces firing dependencies; memory state chains model per-loop memory traffic ordering.
Distributed Memory Disambiguation: Per-store/load address queues (ADDR-Qs) replace centralized load-store queues (LSQ), scaling resource use with loop nesting/locality. Correctness is enforced by: $\forall~k<i:~addr(L_j)\neq addr(S_k)\Longrightarrow L_j$ may fire before $S_i$ .

Conversion to a parallel elastic circuit leverages handshake bundle generation (data and state edges mapped to ready/valid channels), topological scheduling, and strategic buffer insertion.

6. LLM-Driven HLS Generation and Verification Workflows

Recent advances employ LLMs fine-tuned on HLS dialects for automated code generation. Datasets (HLStrans (Zou et al., 6 Jul 2025), SAGE-HLS (Khan et al., 5 Aug 2025), Code-Llama (Gai et al., 19 Feb 2025)) consist of tens of thousands of C-to-HLS transformations, with annotated pragmas and performance labels. Syntax and semantic correctness are improved with structured prompting (chain-of-thought), iterative feedback loops (syntax + functional), and AST-guided instruction prompting (serialized AST context) (Gai et al., 19 Feb 2025, Khan et al., 5 Aug 2025).

Metrics demonstrate that chain-of-thought and feedback loops raise pass@3 rates by 6–13% absolute; SAGE-HLS achieves near 100% synthesizability and 75% functional correctness at pass@10. Token efficiency is a key driver, with HLS code requiring $3\times$ – $4\times$ fewer tokens than equivalent Verilog (Gai et al., 19 Feb 2025).

7. Comparative Evaluation, Portability, and Design Guidelines

Dialects and frameworks differ in IR expressivity, portability, and performance:

AnyHLS (Özkan et al., 2020): Uses higher-order functional abstractions (Impala), enabling modular generator composition and vendor-agnostic output; partial evaluation specializes abstractions, yielding synthesizable HLS code for Intel/Xilinx.
ScaleHLS (Ye et al., 2021): Leverages multi-level IR (onnx, scf, affine, hlscpp), automating DSE via loop/array/metadata transformations and analytical cost modeling. Attributes drive downstream HLS code emission with improved quality-of-results, reaching up to $3825\times$ speedup on neural models.
R-HLS (Metz et al., 16 Aug 2024): Outperforms state-of-the-art dynamic HLS (StoQ) by exposing inter-block parallelism and decoupling memory-ordering, with benchmarks showing $10\%$ speedup, $79\%$ LUT and $22\%$ FF reduction on irregular kernels.

Practical guidelines emphasize prompt engineering (explicit reasoning steps), modular code patterns (≤200 tokens/function), custom macros for pragmas, and two-step feedback loops (Gai et al., 19 Feb 2025). Vendor-specific portability is optimally achieved by abstracting away pragmas at the source level (AnyHLS) or encoding them as attributes in the IR (ScaleHLS).

In summary, HLS dialects constitute formally restricted, extension-rich programming conventions and IRs that enable systematic, transformable, and efficient translation from software-level code to synthesizable hardware. Ongoing research, as demonstrated in R-HLS, AnyHLS, ScaleHLS, HLStrans, and SAGE-HLS, is empirically advancing expressivity, correctness, portability, and automation within this technically demanding domain.