Automated HLS Code Generation
- The paper introduces automated HLS code generation as a process that converts high-level behavioral and algorithmic specifications into synthesizable HDL, streamlining the hardware design workflow.
- It leverages advanced techniques such as LLMs, multi-agent systems, and retrieval-augmented generation to handle parsing, refactoring, pragma insertion, and system-level verification.
- The approach reduces human intervention while improving design quality, achieving significant speedup and resource optimizations as validated by comprehensive benchmarks.
Automated High-Level Synthesis (HLS) Code Generation
Automated high-level synthesis (HLS) code generation refers to the fully or semi-automated conversion of behavioral or algorithmic specifications—typically provided as C or C++ code, natural-language instructions, or formal specifications—into hardware-ready HLS code that is synthesizable into hardware description languages (HDL) such as Verilog or VHDL by commercial or open-source HLS tools. This process leverages LLMs, multi-agent systems, retrieval-augmented generation (RAG), template engines, and integrated design space exploration (DSE) to streamline or replace manual, error-prone stages of the hardware design workflow. The goal is to produce functionally correct, synthesizable, and high-performance hardware designs with reduced human intervention, increased productivity, and robust quality-of-results (QoR) (Abi-Karam et al., 16 Apr 2025, Khan et al., 16 Jan 2026).
1. HLS Code Generation Fundamentals
Automated HLS code generation targets the synthesis of hardware from high-level descriptions by transforming and optimizing C/C++ code into a form compatible with vendor HLS tools, or by generating such code directly from abstract specifications or natural language. The process encompasses:
- Parsing and Refactoring: Translation of sequential or software-style C/C++ into HLS-synthesizable code, entailing the elimination of unsupported constructs (e.g., dynamic memory, recursion, pointers), conversion of floating-point to fixed-point types, and rewriting for streaming or dataflow architectures (Collini et al., 2024, Xu et al., 2024, Zou et al., 6 Jul 2025).
- Algorithmic Decomposition: Automated partitioning of high-level functionality into smaller synthesizeable modules. Systems such as SynthAI formalize this as a decision graph, decomposing the specification into a module graph for stepwise synthesis (Sheikholeslam et al., 2024).
- Pragma Insertion and Optimization: Automated or feedback-driven selection and insertion of HLS optimization directives (pragmas) such as
#pragma HLS pipeline,unroll,array_partition, anddataflow, which control parallelism, pipelining, and memory hierarchy (Abi-Karam et al., 16 Apr 2025, Li et al., 1 Jul 2025). - Evaluation and Verification: Integration with automated compilation, C-simulation, synthesis, and post-synthesis validation for area, power, timing, and functional correctness, as well as support for design space exploration and resource/latency trade-off analysis (Basalama et al., 15 Jan 2025, Wanna et al., 21 Apr 2025, Pouget et al., 2024).
2. Architectures and Methodologies
LLM-Based and Multi-Agent Pipelines
State-of-the-art automated HLS workflows are increasingly based on LLMs and/or multi-agent systems:
- Single-LLM with Feedback: LLMs ingest structured prompts comprising code, design context, and examples, iteratively yielding HLS code. Feedback loops incorporate compiler/simulator/synthesis errors to guide refinement (Xu et al., 2024, Collini et al., 2024, Gai et al., 19 Feb 2025).
- Multi-Agent Architectures: Frameworks such as SynthAI and Spec2RTL-Agent leverage chains of specialized agents that handle decomposition, progressive refinement (often spanning pseudocode, Python prototype, and final C++), verification, prompt optimization, and reflection/error analysis. These are frequently orchestrated as decision graphs or propose-and-verify loops (Sheikholeslam et al., 2024, Yu et al., 16 Jun 2025).
- Retrieval-Augmented Generation (RAG): Many frameworks use RAG to ground LLM outputs with examples or knowledge from vendor manuals, datasets of code+pragmas, or prior designs to reduce hallucinations, match device-specific constraints, and improve synthesis reliability (Mashnoor et al., 23 Jul 2025, Zou et al., 6 Jul 2025, Abi-Karam et al., 16 Apr 2025, Xu et al., 2024).
- Hybrid with Symbolic/Optimization Engines: Some flows integrate analytical models, integer programming, or mixed-integer nonlinear programming (MINLP) solvers for resource-driven pragma selection, tiling, unrolling, and array partitioning (Basalama et al., 15 Jan 2025, Pouget et al., 2024).
Example System Architecture Table
| Framework | Code Generation | Optimization/Feedback | Evaluation/Integration |
|---|---|---|---|
| HLS-Eval | LLM+Prompt Templates | Multi-stage tool feedback | parse→compile→simulate→synth |
| SynthAI | Multi-agent CoT+RAG | Decision graph, CoT, RAG | module/graph-level pass/fail |
| ChatHLS | Multi-agent LLMs | Error analyzers, meta-voting | Synthesis, auto repair |
| TimelyHLS | LLM+RAG | Structured device KB, feedback | HLS/RTL-level iteration, timing |
| HLSPilot | LLM+Profiling | DSE for pragmas, bottleneck detection | Host integration, end-to-end perf |
| C2HLSC | LLM+Divide/Conquer | Compile/synth feedback loop | Pragma optimization, per-fn/unit |
| Spec2RTL-Agent | Multi-agent, progressive | Reflection module, error tracing | Hierarchy→C++→RTL, min. interventions |
| SAGE-HLS | LLM fine-tuned on AST | AST-guided, VerilogEval | Synthesizability, func. correctness |
3. Benchmarks, Datasets, and Evaluation Metrics
Evaluating automated HLS code generation relies on rigorously constructed benchmarks and reproducible metrics:
- Benchmarks: Cover a wide spectrum, from small kernels (FIR, dot-product), through mid-scale accelerators (2D FFT, Sobel), to full ML blocks (Conv, GEMM, AES, neural nets). Notable open-source suites include HLS-Eval (94 designs), Bench4HLS (170 curated cases), HLStrans (137 kernels, 23K+ variants), and ForgeBench (6K+ ML designs) (Abi-Karam et al., 16 Apr 2025, Khan et al., 16 Jan 2026, Zou et al., 6 Jul 2025, Wanna et al., 21 Apr 2025).
- Evaluation Metrics:
- Parseability: Whether code fragments can be extracted and parsed.
- Compilability: HLS tool acceptance without syntax errors.
- Runnability: Passing simulation against reference or testbench.
- Synthesizability: Success in HLS-to-RTL translation.
- Functional Correctness: Output matches golden reference over test suites.
- Hardware PPA: Post-synthesis area (LUTs, FFs, DSPs), power, latency, frequency.
- Pass@k: Fraction of tasks for top-k LLM outputs passing one or multiple evaluation stages:
where is number of samples, number of correct samples (Abi-Karam et al., 16 Apr 2025).
- Design Space Exploration: Automated DSE leverages analytical or simulation-based models to optimize pragma choices subject to multi-objective criteria (latency, resource use), with some frameworks formulating global MINLPs incorporating permutation, tiling, and array partitioning decision variables (Basalama et al., 15 Jan 2025, Pouget et al., 2024).
4. Failure Modes, Optimization Strategies, and Best Practices
Dominant Failure Modes
Omitted or mis-placed pragmas, leading to loss of pipelining or resource conflicts
Unsupported C++/C constructs (dynamic allocation, recursion, malloc, arbitrary pointers)
Loop bound mishandling, off-by-one or dynamic range errors
Data type mismatches (e.g., signed vs. unsigned, missing fixed-point conversion)
Testbench or interface mismatches (Abi-Karam et al., 16 Apr 2025, Khan et al., 16 Jan 2026, Xu et al., 2024).
Optimization and Prompt Engineering Strategies
Chain-of-Thought Prompting: Structuring prompts to force LLMs to reason stepwise—planning types, loop structures, and pragmas before emitting code (Abi-Karam et al., 16 Apr 2025, Gai et al., 19 Feb 2025, Clemente et al., 21 May 2025).
Tool-Feedback Loops: Feeding simulation/compilation/synthesis errors into prompts or agent pipelines for iterative correction (Xu et al., 2024, Collini et al., 2024, Khan et al., 16 Jan 2026).
RAG with Example Libraries: Grounding via retrieval of matching code/pragma/explanation templates from curated databases to limit hallucination and encourage legal transformations (Mashnoor et al., 23 Jul 2025, Zou et al., 6 Jul 2025).
Hierarchical/Divide-and-Conquer: Breaking code into units (functions, sub-modules) for independent translation, as in bottom-up refactoring in C2HLSC, preserving context size (Collini et al., 2024).
Modular Prompting and Function-Calling: For complex or multi-stage edits, splitting prompts or leveraging structured outputs to separate code from commentary (Abi-Karam et al., 16 Apr 2025).
Recommended prompt structure:
Explicitly state required pragmas and function signatures
Provide clean, minimal headers and testbenches to minimize LLM confusion
Request JSON/function-call delineation if supported (Abi-Karam et al., 16 Apr 2025, Gai et al., 19 Feb 2025)
5. Comparative Performance and Empirical Results
Systematic evaluations reveal that automated HLS code generation is increasingly matching or surpassing prior hand-tuned or DSL-based flows:
Success Rates (Bench4HLS, Pass@1): GPT-5: 50%, Llama 70B: 44%, Qwen 32B: 24% for full synthesis; functional correctness lower, improved via pass@k sampling (Khan et al., 16 Jan 2026).
Performance Gains:
- Stream-HLS achieved up to 79.43× geometric mean speedup on canonical benchmarks over prior automation, with DSE running ~176× faster than previous iterative solutions (Basalama et al., 15 Jan 2025).
- ChatHLS reports 4.9× geometric mean speedup over state-of-the-art DSLs on resource-constrained accelerators (Li et al., 1 Jul 2025).
- TimelyHLS obtained up to 3.85× latency speedup and over 50% area reduction by automatically inserting and tuning architecture-specific pragmas (Mashnoor et al., 23 Jul 2025).
- Synthesis and Functional Pass Rates: Methods using fine-tuned LLMs and RAG (SAGE-HLS, QWEN-HLS) reach >92% synthesizability and up to 75.6% functional correctness for complex benchmarks under pass@10 evaluation (Khan et al., 5 Aug 2025).
- Intervention Reduction: Multi-agent frameworks such as Spec2RTL-Agent reduce human intervention by up to 75% compared to previous LLM- or script-based flows (Yu et al., 16 Jun 2025).
- Bitwidth and PPA Optimization: Bitwidth inference and automated pragma insertion following LLM- or RAG-guided repair can yield up to ~37% area and ~33% power savings (static bit-width optimization), with an additional ~14% area and ~18% latency reduction from targeted pragma tuning (Xu et al., 2024).
6. Architectural Specialization and Domain Adaptation
Recent systems have advanced from generic code generation to architecture-specific and domain-adapted tools:
- Timing and Architecture Awareness: TimelyHLS demonstrates the use of structured knowledge bases to inform LLM-driven generation with device constraints, empirical pragma efficacy, and latency/resource cost models, iteratively refining for timing closure (Mashnoor et al., 23 Jul 2025).
- ML/Accelerator Focus: ForgeBench provides generation pipelines parameterized by ML operator templates and JSON-based design intent, supporting thousands of ML-specific HLS designs (Wanna et al., 21 Apr 2025).
- Task-Parallelism: Frameworks like TAPA support code generation for task- and channel-parallel programs with custom C++ extensions, coroutine-based simulation, and hierarchical code generation (Chi et al., 2020).
- Modular and Multi-Objective Flows: SynthAI formalizes design decomposition and module integration as decision graphs, allowing modular generation while targeting performance under resource constraints (Sheikholeslam et al., 2024).
- Dataset-Driven Fine-Tuning: SAGE-HLS constructs and leverages large-scale synthetic corpora by porting verified Verilog to HLS-C, enabling fine-tuned instruction+AST-aware LLMs to achieve high synthesizability and improved functional correctness (Khan et al., 5 Aug 2025).
7. Limitations and Future Directions
Current automated HLS frameworks exhibit the following limitations and open directions:
- Irregular Control and Memory: Most frameworks are restricted to static-control, regular loop-based code; irregular, data-dependent, or recursive designs still pose challenges (Pouget et al., 2024, Collini et al., 2024).
- Testbench and Coverage Generation: Automated, coverage-driven testbench creation remains immature, often requiring manual reference outputs or assertion harnesses (Khan et al., 16 Jan 2026).
- Vendor/Platform Support: Many datasets and frameworks target Xilinx/AMD tools and lack coverage for Intel HLS or Catapult; cross-vendor generalization is an active area (Zou et al., 6 Jul 2025, Wanna et al., 21 Apr 2025).
- Physical Validation: Most evaluation uses synthesis estimates; extension to post-route and on-board evaluation for throughput, power, and real performance is ongoing (Zou et al., 6 Jul 2025).
- Prompt and Retrieval Library Maintenance: RAG and template libraries must be curated and maintained with the evolution of HLS tool error messages and supported pragmas (Xu et al., 2024).
- Integration of Hardware Performance Metrics into LLM Objectives: Some RL-based frameworks (e.g., Proof2Silicon) and multi-agent loops plan to incorporate hardware metrics (latency, resource, timing) as part of the reward structure for code generation (Jha et al., 7 Sep 2025).
- Automated Module Extraction and Architecture-Oriented Synthesis: Next-generation flows aim to use e-graphs and structure mining to extract reusable modules for scalable, architecture-oriented generation (Wanna et al., 21 Apr 2025).
Automated HLS code generation thus represents a converging line of research—bringing together LLMs, agent systems, code datasets, analytical and symbolic optimization, and hardware design automation—enabling systematic, scalable, and high-quality translation from software-level intent to robust, FPGA-ready hardware implementations (Abi-Karam et al., 16 Apr 2025, Sheikholeslam et al., 2024, Basalama et al., 15 Jan 2025, Khan et al., 16 Jan 2026, Wanna et al., 21 Apr 2025, Collini et al., 2024, Khan et al., 5 Aug 2025).