Piecewise Polynomial LUT Networks (PolyLUT)

Updated 5 February 2026

Piecewise Polynomial LUTNs (PolyLUT) are neural architectures that compute multivariate polynomial maps over quantized inputs using LUTs, enabling efficient FPGA implementations.
They employ degree-D polynomial expansions to transform inputs into lookup table evaluations, yielding ultra-low latency and enhanced area efficiency compared to traditional DNNs.
PolyLUT-Add extends this concept by partitioning inputs into groups and summing sub-neuron outputs, reducing exponential LUT size to near-linear scaling through efficient adder networks.

Piecewise Polynomial LUT Networks (PolyLUT) and their combinatorial extension PolyLUT-Add represent a class of neural architectures in which each neuron computes a multivariate polynomial over quantized inputs, with the entire mapping efficiently implemented as lookup tables (LUTs) on digital hardware such as FPGAs. This approach exploits the functional expressiveness and computational regularity of polynomials to achieve ultra-low latency and high area efficiency, overcoming core limitations of conventional linear or ReLU-based DNNs deployed via LUTs.

1. Mathematical Foundations and Neuronal Mapping

A PolyLUT neuron maps $F$ quantized inputs to an output by evaluating a degree- $D$ polynomial expansion. Each input $x_f$ is quantized to $\beta$ bits, yielding $x = (x_0, ..., x_{F-1}) \in \{0,1,...,2^{\beta}-1\}^F$ . The full monomial basis up to degree $D$ spans

$M = \binom{F+D}{D}$

possible terms $m_i(x)$ , comprising both pure and mixed powers up to the total degree constraint. The neuron output is

$y = \sigma\left( \sum_{i=0}^{M-1} w_i m_i(x) + b \right)$

where the $w_i$ and $b$ are learned parameters, and $\sigma(\cdot)$ is a quantized nonlinearity (commonly 1–2 bits). For hardware realization, the full polynomial mapping—post-activation—is tabulated as a function of its $\beta F$ -bit input word, forming a truth table of size $2^{\beta F}$ . This formulation yields an intrinsically piecewise-polynomial input–output behavior, with partitioning induced by quantized domains and activation boundaries (Lou et al., 2024, Andronic et al., 2023, Andronic et al., 14 Jan 2025).

2. PolyLUT-Add: Compositional Fan-in Scaling

PolyLUT-Add extends PolyLUT by dramatically increasing effective fan-in without prohibitive resource scaling. Given $A\cdot F$ total inputs, the input set is partitioned into $A$ groups $S_1,...,S_A$ , each with $F$ elements. Each group is processed by an independent sub-neuron $g_a$ , each realizing a standard PolyLUT mapping over its subset:

$g_a(x_{S_a}) = \langle w_{a,0},...,w_{a,M-1},b_a \rangle \cdot m(x_{S_a})$

Each $g_a$ emits a $(\beta+1)$ -bit output. The overall neuron output is the sum of all sub-neuron outputs, optionally batch-normalized and quantized:

$f(x) = \sigma\left( \sum_{a=1}^A g_a(x_{S_a}) \right)$

This structure replaces a single exponential-size LUT of $2^{\beta (A F)}$ entries (impractical for large $A$ ) with $A$ PolyLUT sub-tables of size $2^{\beta F}$ and a small adder-LUT of $2^{A (\beta+1)}$ , reducing LUT resource requirement from exponential to near-linear in $A$ for fixed $F, \beta$ (Lou et al., 2024).

3. Hardware Architecture and Dataflow

PolyLUT and PolyLUT-Add are mapped to FPGA architectures as layered networks of logical LUTs ("L-LUTs"), where each neuron or sub-neuron is a small LUT cluster evaluated in parallel. The principal components are:

Poly-layer: $A$ clusters, each a PolyLUT sub-LUT operating in parallel, outputting independent partial results every clock cycle.
Adder-layer: An $A$ -input summing network (either a small LUT or an associative adder tree) accumulates the partial results. The output is subsequently batch-normalized and quantized.
Pipelining: Dual-stage or single-stage register placement is chosen according to clock frequency and throughput targets.

PolyLUT-Add's compositional structure and resource-conscious partitioning enable scaling to higher total fan-in without incurring exponential table growth. Decomposition to 6-LUT primitives with logic minimization (e.g., Vivado) is standard to fit within modern FPGA fabrics (Lou et al., 2024, Andronic et al., 14 Jan 2025).

4. Resource Scaling and Optimization Trade-offs

The central resource constraint arises from the exponential LUT size scaling with fan-in. For a PolyLUT neuron with fan-in $F$ and input quantization $\beta$ , the required LUT entries are $2^{\beta F}$ . PolyLUT-Add shifts this to $L_2 = A \cdot 2^{\beta F} + 2^{A(\beta+1)}$ . This is substantially more tractable for moderate $A$ (e.g., $A=2$ yields a $>1{,}000\times$ reduction compared to a single large LUT of the equivalent total fan-in). Selecting $(F, \beta)$ to keep sub-table sizes practical (typically a few thousand entries), then growing $A$ to increase accuracy, is the prevailing configuration strategy.

Degree $D$ controls monomial count per sub-neuron as $M = \binom{F+D}{D}$ . Smaller values of $F$ and $D$ reduce both parameter and LUT cost. After training, it is standard to retune $D$ or $F$ downward, leveraging the higher effective connectivity via $A$ for accuracy recovery (Lou et al., 2024, Andronic et al., 14 Jan 2025).

Hyperparameter guidelines emphasize $(\beta,F)$ such that $\beta F \lesssim 12$ –14, and $D\leq 3$ for optimal accuracy-to-complexity trade-off. Structured sparsity regularizers are imposed during training to ensure learned polynomial FAN-in matches hardware resource budgets (Andronic et al., 14 Jan 2025).

5. Empirical Benchmarks and Comparative Analysis

Comprehensive experimental validation has demonstrated PolyLUT and PolyLUT-Add's performance in terms of inference accuracy, area (FPGA LUT count), and latency. Key results include:

Dataset	Model (PolyLUT-Add / PolyLUT)	Acc (%)	LUTs	Latency (ns)	Reduction vs PolyLUT
MNIST	Add2, D=3 / D=4	96.0	15,272 / 70,673	7 / 16	4.6× LUT, 2.3× latency
JetSub XL	Add2, D=3 / D=4	75.0	47,639 / 236,541	13 / 21	5.0× LUT, 1.6× latency
JetSub Lite	Add2, D=3 / D=6	72.0	1,618 / 12,436	4 / 5	7.7× LUT, 1.2× latency
UNSW-NB15	Add2, D=1 / D=4	92.0	2,591 / 3,336	8 / 9	1.3× LUT, 1.2× latency

These reductions are achieved without compromising target accuracy. On jet classification, for example, PolyLUT–Add reduces both LUT area and latency by factors of $5$– $7.7\times$ compared to conventional PolyLUT. Across all benchmarks, PolyLUT networks outperform classic linear/LUT designs in both area and speed—e.g., on MNIST, PolyLUT-Add achieves $96\%$ accuracy with $4.6\times$ fewer LUTs and $2.3\times$ decreased latency. These results are consistent across multiple architectures and datasets (Lou et al., 2024, Andronic et al., 2023, Andronic et al., 14 Jan 2025).

6. Piecewise Polynomial Networks: Sparsity and Function Approximation

The use of piecewise polynomials for learning has theoretical and practical advantages. Each mapping segment (sub-LUT) implements a compact polynomial on a quantized hypercube subset of inputs, providing locality and functional sparsity. Only the active subregion for a given input is evaluated, preserving computational regularity and minimizing logic switching. Experiments have demonstrated that moving from piecewise-linear ( $D=1$ ) to quadratic ( $D=2$ ), or higher-order, yields substantial accuracy improvements, with diminishing returns beyond $D\sim4$ (Loverich, 2015).

Structured hardware-aware regularization—such as group sparsity-penalties and pruning protocols—enables the practical realization of complex polynomial mappings in FPGA LUT fabrics. This co-design of function class, quantization, and LUT mapping facilitates closed-form guarantees on resource utilization and evaluation latency (Andronic et al., 14 Jan 2025, Andronic et al., 2023, Orloski et al., 2022).

7. Design Space Exploration and Hardware Retargeting

The design space of PolyLUT-based approximations is systematically enumerated by balancing polynomial degree ( $d$ ) and number of quantization regions ( $N$ ), subject to a uniform error bound $\varepsilon$ . Established methods provide enumeration algorithms for all feasible $(d, N)$ that can achieve $\|f - p\|_\infty \leq \varepsilon$ , given memory and logic constraints. Closed-form Taylor or minimax bounds relate the maximum segment width $\Delta$ and required number of regions as a function of $d$ . The selection of $(d, N)$ for resource-efficient hardware is handled via linear cost models that are easily retargeted to FPGA, ASIC, or emerging compute substrates by modifying per-unit memory and logic cost coefficients (Orloski et al., 2022).

A simple two-variable optimization yields Pareto-optimal designs for the desired error metric and area/latency characteristics, ensuring efficient mapping across various hardware platforms.

References:

(Lou et al., 2024, Andronic et al., 14 Jan 2025, Andronic et al., 2023, Loverich, 2015, Orloski et al., 2022)