Papers
Topics
Authors
Recent
Search
2000 character limit reached

Piecewise Polynomial LUT Networks (PolyLUT)

Updated 5 February 2026
  • Piecewise Polynomial LUTNs (PolyLUT) are neural architectures that compute multivariate polynomial maps over quantized inputs using LUTs, enabling efficient FPGA implementations.
  • They employ degree-D polynomial expansions to transform inputs into lookup table evaluations, yielding ultra-low latency and enhanced area efficiency compared to traditional DNNs.
  • PolyLUT-Add extends this concept by partitioning inputs into groups and summing sub-neuron outputs, reducing exponential LUT size to near-linear scaling through efficient adder networks.

Piecewise Polynomial LUT Networks (PolyLUT) and their combinatorial extension PolyLUT-Add represent a class of neural architectures in which each neuron computes a multivariate polynomial over quantized inputs, with the entire mapping efficiently implemented as lookup tables (LUTs) on digital hardware such as FPGAs. This approach exploits the functional expressiveness and computational regularity of polynomials to achieve ultra-low latency and high area efficiency, overcoming core limitations of conventional linear or ReLU-based DNNs deployed via LUTs.

1. Mathematical Foundations and Neuronal Mapping

A PolyLUT neuron maps FF quantized inputs to an output by evaluating a degree-DD polynomial expansion. Each input xfx_f is quantized to β\beta bits, yielding x=(x0,...,xF−1)∈{0,1,...,2β−1}Fx = (x_0, ..., x_{F-1}) \in \{0,1,...,2^{\beta}-1\}^F. The full monomial basis up to degree DD spans

M=(F+DD)M = \binom{F+D}{D}

possible terms mi(x)m_i(x), comprising both pure and mixed powers up to the total degree constraint. The neuron output is

y=σ(∑i=0M−1wimi(x)+b)y = \sigma\left( \sum_{i=0}^{M-1} w_i m_i(x) + b \right)

where the wiw_i and bb are learned parameters, and σ(⋅)\sigma(\cdot) is a quantized nonlinearity (commonly 1–2 bits). For hardware realization, the full polynomial mapping—post-activation—is tabulated as a function of its βF\beta F-bit input word, forming a truth table of size 2βF2^{\beta F}. This formulation yields an intrinsically piecewise-polynomial input–output behavior, with partitioning induced by quantized domains and activation boundaries (Lou et al., 2024, Andronic et al., 2023, Andronic et al., 14 Jan 2025).

2. PolyLUT-Add: Compositional Fan-in Scaling

PolyLUT-Add extends PolyLUT by dramatically increasing effective fan-in without prohibitive resource scaling. Given Aâ‹…FA\cdot F total inputs, the input set is partitioned into AA groups S1,...,SAS_1,...,S_A, each with FF elements. Each group is processed by an independent sub-neuron gag_a, each realizing a standard PolyLUT mapping over its subset:

ga(xSa)=⟨wa,0,...,wa,M−1,ba⟩⋅m(xSa)g_a(x_{S_a}) = \langle w_{a,0},...,w_{a,M-1},b_a \rangle \cdot m(x_{S_a})

Each gag_a emits a (β+1)(\beta+1)-bit output. The overall neuron output is the sum of all sub-neuron outputs, optionally batch-normalized and quantized:

f(x)=σ(∑a=1Aga(xSa))f(x) = \sigma\left( \sum_{a=1}^A g_a(x_{S_a}) \right)

This structure replaces a single exponential-size LUT of 2β(AF)2^{\beta (A F)} entries (impractical for large AA) with AA PolyLUT sub-tables of size 2βF2^{\beta F} and a small adder-LUT of 2A(β+1)2^{A (\beta+1)}, reducing LUT resource requirement from exponential to near-linear in AA for fixed F,βF, \beta (Lou et al., 2024).

3. Hardware Architecture and Dataflow

PolyLUT and PolyLUT-Add are mapped to FPGA architectures as layered networks of logical LUTs ("L-LUTs"), where each neuron or sub-neuron is a small LUT cluster evaluated in parallel. The principal components are:

  • Poly-layer: AA clusters, each a PolyLUT sub-LUT operating in parallel, outputting independent partial results every clock cycle.
  • Adder-layer: An AA-input summing network (either a small LUT or an associative adder tree) accumulates the partial results. The output is subsequently batch-normalized and quantized.
  • Pipelining: Dual-stage or single-stage register placement is chosen according to clock frequency and throughput targets.

PolyLUT-Add's compositional structure and resource-conscious partitioning enable scaling to higher total fan-in without incurring exponential table growth. Decomposition to 6-LUT primitives with logic minimization (e.g., Vivado) is standard to fit within modern FPGA fabrics (Lou et al., 2024, Andronic et al., 14 Jan 2025).

4. Resource Scaling and Optimization Trade-offs

The central resource constraint arises from the exponential LUT size scaling with fan-in. For a PolyLUT neuron with fan-in FF and input quantization β\beta, the required LUT entries are 2βF2^{\beta F}. PolyLUT-Add shifts this to L2=A⋅2βF+2A(β+1)L_2 = A \cdot 2^{\beta F} + 2^{A(\beta+1)}. This is substantially more tractable for moderate AA (e.g., A=2A=2 yields a >1,000×>1{,}000\times reduction compared to a single large LUT of the equivalent total fan-in). Selecting (F,β)(F, \beta) to keep sub-table sizes practical (typically a few thousand entries), then growing AA to increase accuracy, is the prevailing configuration strategy.

Degree DD controls monomial count per sub-neuron as M=(F+DD)M = \binom{F+D}{D}. Smaller values of FF and DD reduce both parameter and LUT cost. After training, it is standard to retune DD or FF downward, leveraging the higher effective connectivity via AA for accuracy recovery (Lou et al., 2024, Andronic et al., 14 Jan 2025).

Hyperparameter guidelines emphasize (β,F)(\beta,F) such that βF≲12\beta F \lesssim 12–14, and D≤3D\leq 3 for optimal accuracy-to-complexity trade-off. Structured sparsity regularizers are imposed during training to ensure learned polynomial FAN-in matches hardware resource budgets (Andronic et al., 14 Jan 2025).

5. Empirical Benchmarks and Comparative Analysis

Comprehensive experimental validation has demonstrated PolyLUT and PolyLUT-Add's performance in terms of inference accuracy, area (FPGA LUT count), and latency. Key results include:

Dataset Model (PolyLUT-Add / PolyLUT) Acc (%) LUTs Latency (ns) Reduction vs PolyLUT
MNIST Add2, D=3 / D=4 96.0 15,272 / 70,673 7 / 16 4.6× LUT, 2.3× latency
JetSub XL Add2, D=3 / D=4 75.0 47,639 / 236,541 13 / 21 5.0× LUT, 1.6× latency
JetSub Lite Add2, D=3 / D=6 72.0 1,618 / 12,436 4 / 5 7.7× LUT, 1.2× latency
UNSW-NB15 Add2, D=1 / D=4 92.0 2,591 / 3,336 8 / 9 1.3× LUT, 1.2× latency

These reductions are achieved without compromising target accuracy. On jet classification, for example, PolyLUT–Add reduces both LUT area and latency by factors of $5$–7.7×7.7\times compared to conventional PolyLUT. Across all benchmarks, PolyLUT networks outperform classic linear/LUT designs in both area and speed—e.g., on MNIST, PolyLUT-Add achieves 96%96\% accuracy with 4.6×4.6\times fewer LUTs and 2.3×2.3\times decreased latency. These results are consistent across multiple architectures and datasets (Lou et al., 2024, Andronic et al., 2023, Andronic et al., 14 Jan 2025).

6. Piecewise Polynomial Networks: Sparsity and Function Approximation

The use of piecewise polynomials for learning has theoretical and practical advantages. Each mapping segment (sub-LUT) implements a compact polynomial on a quantized hypercube subset of inputs, providing locality and functional sparsity. Only the active subregion for a given input is evaluated, preserving computational regularity and minimizing logic switching. Experiments have demonstrated that moving from piecewise-linear (D=1D=1) to quadratic (D=2D=2), or higher-order, yields substantial accuracy improvements, with diminishing returns beyond D∼4D\sim4 (Loverich, 2015).

Structured hardware-aware regularization—such as group sparsity-penalties and pruning protocols—enables the practical realization of complex polynomial mappings in FPGA LUT fabrics. This co-design of function class, quantization, and LUT mapping facilitates closed-form guarantees on resource utilization and evaluation latency (Andronic et al., 14 Jan 2025, Andronic et al., 2023, Orloski et al., 2022).

7. Design Space Exploration and Hardware Retargeting

The design space of PolyLUT-based approximations is systematically enumerated by balancing polynomial degree (dd) and number of quantization regions (NN), subject to a uniform error bound ε\varepsilon. Established methods provide enumeration algorithms for all feasible (d,N)(d, N) that can achieve ∥f−p∥∞≤ε\|f - p\|_\infty \leq \varepsilon, given memory and logic constraints. Closed-form Taylor or minimax bounds relate the maximum segment width Δ\Delta and required number of regions as a function of dd. The selection of (d,N)(d, N) for resource-efficient hardware is handled via linear cost models that are easily retargeted to FPGA, ASIC, or emerging compute substrates by modifying per-unit memory and logic cost coefficients (Orloski et al., 2022).

A simple two-variable optimization yields Pareto-optimal designs for the desired error metric and area/latency characteristics, ensuring efficient mapping across various hardware platforms.


References:

(Lou et al., 2024, Andronic et al., 14 Jan 2025, Andronic et al., 2023, Loverich, 2015, Orloski et al., 2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Piecewise Polynomial LUTNs (PolyLUT).