Arithmetic Circuit Optimization Frameworks

Updated 27 August 2025

Arithmetic circuit optimization frameworks are systematic strategies that combine algorithmic, formal, and machine learning techniques to refine circuit designs under strict resource and performance constraints.
They employ methods such as score-driven statistical model optimization, e-graph based rewriting, and ILP-guided synthesis to achieve efficient, exact inference and scalable circuit performance.
These frameworks also integrate quantum, tensor network, and evolutionary approaches, enabling robust trade-offs among design objectives like area, delay, and power consumption.

Arithmetic circuit optimization frameworks comprise algorithmic and computational strategies for designing, refining, and synthesizing arithmetic circuits—such as adders, multipliers, and multiply-accumulators—under constraints of inference cost, area, delay, power consumption, and practical correctness. These frameworks encapsulate approaches from incremental compilation and learning-based scoring (as in statistical graphical models), through advanced rewriting systems and formal methods, to generator-based tools and integration with machine learning, as well as quantum circuit optimization. Modern frameworks offer programmatic, scalable means to explicitly balance tradeoffs among key design objectives, leveraging both the mathematical properties of arithmetic computations and engineering constraints.

1. Score-Driven Statistical Model Optimization

Optimization frameworks for arithmetic circuits in statistical graphical model inference use explicit score functions blending accuracy with direct surrogate measures for inference cost. In "Learning Arithmetic Circuits" (Lowd et al., 2012), circuits are optimized according to

$\text{score}(C, T) = \log P(T|C) - k_e \cdot n_e(C) - k_p \cdot n_p(C)$

where $P(T|C)$ is the data likelihood, $n_e(C)$ the circuit edge count, and $n_p(C)$ the parameter count. This function introduces user-weighted penalties ( $k_e, k_p$ ) for model complexity (which directly affects inference time)—embedding inference tractability directly into the learning objective.

The LearnAC algorithm incrementally grows a context-specific Bayesian network circuit by greedily selecting splits that preserve or increase the overall score (balancing data fit and complexity). Efficiency is achieved by incremental compilation techniques, such as the SplitAC subroutine, which locally duplicates only the affected portions of the arithmetic circuit. Heuristics rapidly prune candidate splits with poor cost–likelihood tradeoff, enabling tractable learning of arithmetic circuits whose underlying Bayesian networks may have unbounded treewidth but maintain linear-cost exact inference.

Key experimental outcomes validate that, despite potentially higher structure learning cost, the resulting circuits can deliver exact inference in milliseconds, in contrast to seconds for approximate inference in context-specific Bayesian networks, and yield more accurate predictions with far less inference-time overhead (Lowd et al., 2012).

2. Circuit Lower Bounds and Complexity Constraints

Optimization viability is tightly constrained by circuit lower bounds. "Arithmetic Circuit Lower Bounds via MaxRank" (Kumar et al., 2013) introduces the polynomial coefficient matrix, with the max-rank complexity measure

$\operatorname{maxrank}(M_f) = \max_{S} \operatorname{rank}(M_f|_S)$

where $S$ ranges over all variable assignments. This measure generalizes partial derivative matrix arguments to the non-multilinear case, and possesses additive/multiplicative properties under arithmetic circuit composition. Rigorous max-rank analysis yields superpolynomial and exponential lower bounds for non-multilinear depth-3 circuits, product-sparse formulas, and partitioned arithmetic branching programs. For instance, any homogeneous depth-3 circuit computing iterated matrix multiplication (IMM) of $d$ $n \times n$ matrices requires size at least

$\Omega\left(\frac{n^{d-1}}{2^d}\right)$

surpassing prior bounds for large $d$ . These lower bounds imply that frameworks imposing product dimension, sparsity, or partitioning constraints cannot compress high-rank explicit polynomials, so optimization must strategically balance expressivity with complexity limitations (Kumar et al., 2013).

Further, lower bounds for functional computation (Forbes et al., 2016)—where only input–output agreement is required on a Boolean hypercube—demonstrate that circuit simplifications preserving only output value (not polynomial form) are often limited by "shifted evaluation dimension" and cannot bypass complexity bottlenecks. Bridging these results to boolean circuit complexity establishes that strong functional lower bounds for arithmetic circuits can yield major complexity class separations (#P vs ACC⁰), indicating deep connections between arithmetic and boolean optimization (Forbes et al., 2016).

3. Automated Rewriting Systems and E-Graphs

Large-scale practical optimization leverages automated rewriting systems based on e-graph data structures (Coward et al., 2022, Wanna et al., 2023, Coward et al., 2023, Coward et al., 18 Apr 2024). An e-graph compactly represents a vast equivalence class of functionally identical circuit expressions, enabling the application of local, conditional rewrite rules at both the arithmetic and gate level.

Rewriting frameworks express optimizations as localized equivalence-preserving transformations (e.g., distributing multiplication over addition, merging or splitting compressor cells, applying carry-save or fused-multiply-add representation). Bitwidth is explicitly modeled, ensuring validity and cost-awareness through side-conditions on rewrites. Costs are generally measured in terms of required two-input gates, and rewrites only proceed when cost is non-increasing under given constraints.

A global extraction step—often formulated as an integer linear programming problem—selects the optimal circuit structure from the e-graph, sharing common subexpressions where possible. In multiplier synthesis, e-graph based tools (e.g., OptiMult (Wanna et al., 2023)) allow local rewrites between the AND array, sum-of-rows, and various compressor trees, dividing optimization into phases (e.g., compressor placement, Boolean synthesis), and can deliver up to 46% latency reduction in squarers and 9% in standard multipliers compared to synthesized baseline components.

E-graphs are further extended to support domain-aware rewrites and constraint-driven specialization (e.g., floating point architectures with near/far paths) via integration with abstract interpretation theory (Coward et al., 2023). ROVER (Coward et al., 18 Apr 2024) advances the state of the art by weaving arithmetic and workload-informed power optimization into a unified e-graph reformulation, incorporating data gating and clock gating rewrites that yield up to 33.9% reduction in dynamic power for industrial circuit workloads.

4. Generator-Based and Evolutionary Frameworks

Arithmetic circuit generators such as ArithsGen (Klhufek et al., 2022) provide modular Python-based meta-languages to enable rapid generation and optimization of a range of arithmetic architectures, including flat and hierarchical adders, multipliers, and MAC units. These generators expose internal structure and permit fine-grained customization (e.g., adder style selection within multipliers, hierarchical flattening). The output can be directly leveraged by formal verification, simulation (C/C++), and approximation optimization engines, including Cartesian Genetic Programming (CGP).

Integrating formal verification and adaptive resource allocation, evolutionary optimization frameworks (e.g., ADAC (Ceska et al., 2020)) combine candidate circuit generation (via CGP) with SAT-based miter evaluation for worst-case error. An adaptive search strategy dynamically adjusts formal verification resource limits to maximize search throughput, discarding designs that are slow to verify or non-promising, and focusing effort when progress stalls. This results in dense, scalable Pareto fronts in error–power tradeoff, and enables the synthesis of provably correct approximate variants of multipliers, dividers, and MACs, making them suitable for energy-aware applications with guaranteed error bounds.

BDD-based evaluation algorithms (Mrazek, 2022) accelerate error metric computation for approximation loops, offering up to 30× speedup over standard BDD approaches by carefully eliminating absolute value calculations or splitting into positive and negative branches. This further enables rapid convergence in evolutionary and heuristic optimization settings, albeit with the caveat that BDD-based scalability is a challenge for large or structurally complex circuits.

5. Advanced Synthesis via Integer Linear Programming

Integer linear programming (ILP) formulations underpin several recent high-performance multiplier and MAC optimization frameworks. UFO-MAC (Zuo et al., 13 Aug 2024) synthesizes Pareto-optimal compressor trees and exploit non-uniform arrival time profiles of carry-propagate adder stages. ILP first assigns optimal counts and positions of 3:2 and 2:2 compressors per bit column. A second ILP allocation exploits critical path timing by mapping individual input arrival times to compressor port assignments, using "big-M" constraints to linearize the max delay calculation:

$s = \max(a + T_{as}, b + T_{bs}, d + T_{cs})$

$c = \max(a + T_{ac}, b + T_{bc}, d + T_{cc})$

CPA segments are then tailored: prefix graphs implement high-speed addition in critical regions, while increment or simpler structures serve the ends. Fused MAC architectures integrate accumulation directly into the partial product reduction phase, further minimizing area and delay by bypassing the need for a standalone CPA stage. UFO-MAC outperforms commercial IP on area and delay Pareto frontiers and demonstrates significant system-level benefits when instantiated in FIR filters and systolic arrays (Zuo et al., 13 Aug 2024).

6. Machine Learning Approaches: Diffusion Models for Circuit Synthesis

AC-Refiner (Xue et al., 3 Jul 2025) recasts arithmetic circuit synthesis as a conditional image generation problem, employing diffusion models (with U-Net architectures) to generate design candidates conditioned on quality-of-results (QoRs), such as area and delay. The circuit’s topology is encoded as a binary tensor; diffusion proceeds by iterative denoising, with sampling guided by a neural cost predictor evaluating the predicted QoR against a target.

Gradient-guided corrections in the denoising process help steer sampling toward desired locations in the design space (e.g., near the Pareto frontier). Subsequently, a legalization module ensures that generated structures adhere to correctness constraints. A self-refining loop incorporates new high-performing designs into the training corpus, incrementally improving the generator’s focus on high-potential regions of the design space. Empirical evidence demonstrates up to 15% delay and 10% area reduction over RL- and ILP-based baselines, with successful deployment in systolic array MACs and large VLSI flows (Xue et al., 3 Jul 2025).

7. Quantum and Tensor-Network-Based Circuit Optimization

Quantum arithmetic circuit optimization frameworks extend classical techniques to account for resource-limited, error-prone quantum computation. Quantum-specific strategies include mapping arithmetic function evaluation to reversible logic (with pebble game scheduling for space-time tradeoffs) (Häner et al., 2018), employing ZX calculus graphical rewriting (spider fusion, phase cancellation) to minimize T-gate and ancilla usage (Joshi et al., 2023), and scalable QFT-based architectures for multi-input addition with optimal gate counts on qubit and ququart devices (Kurt et al., 31 Oct 2024). Quantum clock and datapath optimizations leverage resource estimation formulas for Toffoli gates and circuit width to guide function implementation choices.

Tensor network frameworks for arithmetic circuits (Peng et al., 2022) recast the structure of arithmetic computations as tensor contractions, enabling high-dimensional function integration and manipulation to avoid the exponential cost ("curse of dimensionality") found in naive grid-based or black-box methods. Here, the tensor network—constructed directly from the circuit's computational DAG—exposes compressible structures for efficient integration and paves the way for crossing application boundaries (statistical inference, quantum simulation) where arithmetic circuit structure is crucial.

Arithmetic circuit optimization frameworks thus encompass a spectrum of techniques, from score-driven structure learning and rigorous complexity theory to automated programmatic rewriting, generator-based tools, ILP-guided architecture synthesis, diffusion model-aided generative design, and quantum circuit adaptation. The rapid evolution and interaction of these methods reflect the growing need to efficiently traverse large and complex arithmetic design spaces while guaranteeing correctness, tractability, and high application-level performance.