Papers
Topics
Authors
Recent
Search
2000 character limit reached

Parameterizable Accelerator Core

Updated 3 February 2026
  • Parameterizable accelerator cores are hardware microarchitectures with tunable parameters that enable systematic design-space exploration across performance, area, energy, and flexibility trade-offs.
  • They utilize analytical, simulation-driven, and machine learning-based methodologies to optimize microarchitectural parameters for achieving Pareto-optimal configurations in ASICs, FPGAs, and programmable arrays.
  • These cores support workload-specific optimizations for applications like DNN inference, transformers, matrix multiplication, and in-memory computing, enhancing overall efficiency.

A parameterizable accelerator core is a hardware microarchitecture whose critical structural, microarchitectural, and software-hardware partitioning choices are exposed as tunable parameters or “knobs,” allowing explicit navigation between competing objectives such as performance, area, throughput, energy, and flexibility. Unlike fixed-function cores, these parameterizable templates admit systematic design-space exploration, workload-driven adaptation, and domain-specific optimizations tuned to application classes such as DNN inference, transformers, matrix multiplication, in-memory computing, or even homomorphic encryption. Empirically, this approach yields accelerators that occupy the Pareto frontier for energy/area/performance across diverse workloads, spanning ASIC, FPGA, and programmable array designs (Maleki et al., 2022, Zhang et al., 2024, Castañeda et al., 2019, Prasad, 21 Nov 2025, Prajapati et al., 2017, Müller et al., 2024, Gong et al., 2021, Yi et al., 2024, Esmaeilzadeh et al., 2023, Nigam et al., 2020, Häusler et al., 27 Oct 2025).

1. Microarchitectural Parameter Space and Tuning Knobs

Parameterizable accelerator cores expose explicit architectural decisions as top-level parameters. Canonical “knobs” include:

Modifying these parameters systematically exposes the trade-off surface between throughput, area, latency, and energy efficiency (Maleki et al., 2022, Zhang et al., 2024, Prasad, 21 Nov 2025, Gong et al., 2021).

2. Analytical Modeling and Performance Formulation

Precise analytic models quantify how parameters impact latency, energy, and resource utilization:

  • Energy models: Etotal=Ecompute+Emem_hierarchyE_{\mathrm{total}} = E_{\mathrm{compute}} + E_{\mathrm{mem\_hierarchy}}, with Ecompute=#MAC×EMACE_{\mathrm{compute}} = \#\mathrm{MAC} \times E_{\mathrm{MAC}} and Emem_hierarchy=ijNaccess(ij)(Erd,i+Ewr,j)E_{\mathrm{mem\_hierarchy}} = \sum_{i \to j} N_{\mathrm{access}}(i \to j)(E_{\mathrm{rd},i} + E_{\mathrm{wr},j}) (Maleki et al., 2022). For in-memory cores, energy/op is nearly input-size invariant; larger arrays increase aggregate power but amortize per-op cost (Castañeda et al., 2019).
  • Latency models: Additive or dominating-path dependent, e.g., for convolutions, “layer completes when last psum is written back to DRAM” (Maleki et al., 2022), or Lcomp=Nops/TeffL_{\mathrm{comp}} = N_{\mathrm{ops}} / T_{\mathrm{eff}} for CGRA-style arrays (Prasad, 21 Nov 2025).
  • Peak throughput: TMAC,peak=NPE×UMAC×fclkT_{\mathrm{MAC,peak}} = N_{\mathrm{PE}} \times U_{\mathrm{MAC}} \times f_{\mathrm{clk}} (Prasad, 21 Nov 2025), or for 3D-unrolled arrays, throughput scales as fclk×Mu×Nu×Ku×2f_{clk} \times M_u \times N_u \times K_u \times 2 (Yi et al., 2024).
  • Utilization: Overall utilization factors in spatial occupation (active PE fraction) and temporal scheduling (pipeline fill/drain) (Yi et al., 2024, Prasad, 21 Nov 2025, Zhang et al., 2024).
  • Area models: Aggregate cell area is linear in array size, PE complexity, buffer dimensions, plus fixed and amortized logic for controllers and interconnect (Prajapati et al., 2017, Yi et al., 2024, Gong et al., 2021).

For design-space exploration or reinforcement-learning-based architecture search, these models serve as cost/constraint functions (Esmaeilzadeh et al., 2023, Gong et al., 2021).

3. Methodologies for Design Space Exploration

Multiple methodologies are employed to identify near-optimal parameter settings:

  • Simulation-driven sweep: Exhaustively or selectively evaluate network × parameter grid to locate optimal energy-delay (EDP) configurations (Maleki et al., 2022, Yi et al., 2024).
  • Analytical rule-based pruning: Use closed-form constraints to eliminate infeasible parameter combinations (e.g., bank/unroll alignment in Dahlia (Nigam et al., 2020), buffer-fit limits in CAT (Zhang et al., 2024)).
  • Machine learning-based prediction: Employ surrogate models (regression forests, GCNs) to predict PPA (power, performance, area) from architectural and backend parameters, reducing RTL/SP&R evaluation cost by orders of magnitude (Esmaeilzadeh et al., 2023).
  • Bayesian/heuristic optimization: Multi-objective Bayesian optimizers (e.g., MOTPE) search parameter space for Pareto frontiers, using learned models as fitness functions (Esmaeilzadeh et al., 2023).
  • Reinforcement learning (RL): Sequential selection of hardware, quantization, and workload split ratios—using deep RL—enables joint tuning for heterogeneous architectures (Gong et al., 2021).
  • Time-sensitive type checking: Compile-time affine-type systems (e.g., in Dahlia) formally prune parameter spaces to legal and predictable regions, dramatically shrinking DSE cardinality while preserving Pareto points (Nigam et al., 2020).

4. Representative Parameterizable Accelerator Cores

Array-based DNN Accelerators

Expose PE array dimensions, register depths, buffer partitions, and memory energy/latency as knobs. Multiple “core types” (size/buffer/tuning) can be instantiated to match layer shapes, with a global controller routing layers for near-optimal EDP per network (Maleki et al., 2022).

In-memory MVP Accelerators (PPAC)

Parametric on array size (R, C), operand precision (p), bank/subrow partitioning (B, Bs), pipeline depth, and threshold modes. High-throughput, fully digital, robust to technology scaling, and supports diverse logic (CAM, GF(2), bit-serial int/uint MVP) (Castañeda et al., 2019).

Transformer/GeMM Accelerators (CAT, OpenGeMM, NX-CGRA)

Expose number and size of matrix-multiply processing units, parallel mode, buffer sizing, data-path widths, and interconnect style. Control interfaces (APB, RISC-V CSR) provide run-time programmability of loop bounds, address increments, and tiling. Hardware utilization is routinely >80–99% across benchmarks (Zhang et al., 2024, Yi et al., 2024, Prasad, 21 Nov 2025).

Heterogeneous FPGA Accelerators (N3H-Core)

Parameterize both DSP and LUT-based GEMM cores, buffer depths, array sizes, and per-layer quantization, with workload splits optimized by RL to exploit resource asymmetry and achieve balanced latency (Gong et al., 2021).

Fully Homomorphic Encryption (TFHE) Accelerators

Expose degree of NTT pipeline parallelism, buffer allocation, lane count, and decomposition parameters, with a functionally complete instruction set (PBS, KeySwitch, MADD). Doubling the NTT lane count yields near-linear throughput scaling up to congestion (Häusler et al., 27 Oct 2025).

5. Case Studies: Design Trade-offs and Pareto Fronts

Empirical exploration reveals discontinuities and sharp trade-offs:

Parameter Efficiency Impact Maximal Gains Observed
Array size Peaks match layer parallelism; over-provision starves PE utilization; under-provision increases DRAM traffic 16–30% EDP reduction vs. monolithic core (Maleki et al., 2022)
Buffer partition ≤5% deviation from optimal buffer split can raise energy 10–30% (Maleki et al., 2022)
MM-PU size/mode Fully pipelined vs. serial increases observed transformer throughput by 20× (Zhang et al., 2024)
MAC/PE utilization Buffer/tiling misalignment can halve utilization (Yi et al., 2024)
Data width EDP increases quadratically with bit-width; lower precision yields higher TOPS/W (Prasad, 21 Nov 2025, Yi et al., 2024)
Accelerator heterogeneity Layer assignment to best core flavor outperforms single-core solutions by 16–30% EDP (Maleki et al., 2022)
Functional coverage Support for in-memory PLA, GF(2), or cryptography allows tailoring without area/energy loss (Castañeda et al., 2019)

In practice, combined top-down (model-driven) and bottom-up (empirical profiling) flows deliver robust assignment of core configurations to workload slice (Maleki et al., 2022, Gong et al., 2021).

6. Software, Modeling, and Verification Infrastructure

  • Code generators and parameterizable RTL: Canonically, Chisel, Verilog generators, or high-level DSLs (ACADL, Dahlia) expose configuration via Python front-ends or algebraic parameters, allowing modular block diagram composition (Müller et al., 2024, Nigam et al., 2020, Yi et al., 2024).
  • Cycle-accurate simulation and timing: Formal semantics (e.g., ACADL’s per-stage (t,ready)(t, ready) update machinery) yield cycle-accurate utilization, pipeline occupancy, and bottleneck prediction (Müller et al., 2024).
  • Type-theoretic correctness: Predictable accelerator design statically eliminates contention/hazard cases, guaranteeing area/latency monotonicity in the pruned parameter space and eliminating “counterintuitive” slowdowns seen with traditional HLS (Nigam et al., 2020).
  • Automated DSE flows: ML-based estimation and RL/DSE scripts execute orders-of-magnitude more rapidly than full hardware SP&R, supporting rapid iteration and Pareto frontier extraction for novel workloads (Esmaeilzadeh et al., 2023, Gong et al., 2021).
  • Instruction set interfaces: Many parameterizable cores export programmable instruction sequences for functional completeness (e.g., TFHE task dequeues, unified ISA for heterogeneous GEMM blocks) (Häusler et al., 27 Oct 2025, Gong et al., 2021).

7. Significance, Best Practices, and Limitations

Parameterizable accelerator cores are now the dominant design template for scientific, edge, server, and security-focused inference and compute workloads due to:

Limitations include reliance on accurate analytical or ML models (necessitating golden-label calibration on new technology nodes), and in some cases ceilinged efficiency on highly irregular workloads where parameterization cannot fully match dynamic workload phase. However, parameterizable cores have become standard in state-of-the-art ML and domain-specific hardware systems (Maleki et al., 2022, Yi et al., 2024, Häusler et al., 27 Oct 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Parameterizable Accelerator Core.