Papers
Topics
Authors
Recent
2000 character limit reached

Hamming Weight Phasing in NESTA

Updated 16 November 2025
  • Hamming Weight Phasing is a hardware technique that reformulates 3×3 convolution arithmetic using hierarchical Hamming weight compressors and phased carry deferral.
  • It minimizes logic depth by replacing deep binary adder trees with shallow popcount circuits, yielding an 8–10% reduced critical path and 40–50% energy savings per MAC batch.
  • The approach buffers and reinjects residual carries across cycles to enable a streamed computation model that improves throughput by about 20–30% in deep-channel workloads.

The Hamming Weight Phasing Approach is a hardware computation technique implemented in the NESTA neural processing engine to accelerate convolutional deep neural network layers by reformulating the arithmetic of 3×3 convolutions through a hierarchy of Hamming weight compressors, temporary carry deferral, and judicious phasing of addition operations. This approach leverages bit-level parallelism and carry buffering to minimize the cycle-critical path and improve energy efficiency, throughput, and timing over conventional adder-based MAC units in deep convolution processing (Mirzaeian et al., 2019).

1. Hierarchical Hamming-Weight Compressors for 3×3 Convolutions

NESTA processes each 3×33 \times 3 input patch—resulting in 9 multiply-accumulate (MAC) pairs—by generating partial products that are bit-aligned to form vectors at each bit position. For word-width WW, each bit column ii (0i<W0 \leq i < W) thus contains up to 9 aligned bits xi0,...,xi8x_i^0, ..., x_i^8. Instead of using a 9-input binary adder tree (e.g., Brent-Kung or Kogge-Stone, each requiring about log2(9)4\log_2(9) \approx 4 full carry-propagate levels), NESTA employs a three-layer “Compression-and-Expansion Logic” (CEL) network of Hamming weight compressors.

The compressor hierarchy operates as follows:

  • CEL-1: Each set of 9 bits at bit position ii is fed to a C_HW(9:4) compressor, producing a 4-bit binary Hamming weight which is then re-aligned, distributing each output bit to its corresponding bit position in the sum.
  • CEL-2: Each bit-column, now with up to 4 or 7 bits post-realignment, is compressed using C_HW(4:3)—or, in the improved design, CC_HW(7:3), which more fully exploits compression for up to 7 bits.
  • CEL-3: Resulting columns, each with at most 2 bits, are finalized using a standard 2-input adder or the “generate” (GEN) half of a carry-propagate adder (CPA).

Mathematically, denoting the partial products at position ii and cycle cc as xi(c),0,...,xi(c),8x_i^{(c),0}, ..., x_i^{(c),8}:

  • yi0..yi3y_i^0..y_i^3 = HWC94_{9\rightarrow4}(xi0..8x_i^{0..8}) for CEL-1
  • zi0..zi2z_i^0..z_i^2 = HWC43_{4\rightarrow3}(yi0..3y_i^{0..3}) for CEL-2
  • Output (gi,pi)=GEN(zi0,zi1,zi2)(g_i, p_i) = \text{GEN}(z_i^0, z_i^1, z_i^2) in CEL-3

This compression strategy reduces physical logic depth, replaces multiple layers of binary adders with popcount circuits, and eliminates deep carry propagation except in the final computation stage.

2. Approximate Partial Sums and Residual Carry Representation

After the final layer of the CEL network, each bit position ii outputs two binary signals per cycle:

  • gig_i (“generate”): the current sum’s low bit
  • pip_i (“propagate”): the carry toward i+1i+1

Over word width WW, collect these to form two WW-bit vectors per convolution cycle cc:

  • S~[c]=i=0W1gi[c]2i\widetilde{S}[c] = \sum_{i=0}^{W-1} g_i[c]\cdot 2^i (approximate partial sum)
  • R[c]=i=0W1pi[c]2iR[c] = \sum_{i=0}^{W-1} p_i[c]\cdot 2^i (residual or deferred carry)

The correct sum for the 9×3×3 MAC batch is S[c]=S~[c]+R[c]S[c] = \widetilde{S}[c] + R[c]; however, NESTA defers the addition of R[c]R[c] and instead buffers it for future cycles, effectively skipping the final carry-propagate-adder (PCPA) stage for all but the last cycle.

3. Carry-Deferral via Phasing Mechanism

The central element of Hamming weight phasing is the delay, or “deferral,” of full carry propagation. Instead of immediately routing each residual (carry) bit pi[c]p_i[c] through a width-WW CPA chain (which would extend the cycle’s critical path), these carry bits are saved into a cycle-indexed Carry-Buffer Unit (CBU), CBi[c]\text{CB}_i[c].

In the subsequent cycle (c+1c+1), when a new group of 9 partial product bits xi(c+1),0..8x_i^{(c+1),0..8} arrives at bit position ii, NESTA injects the deferred carry CBi[c]\text{CB}_i[c] and the previously generated gi[c]g_i[c] into the CEL-1 compressor as additional input bits. Thus, each HWC layer “consumes” the buffered residuals from the prior stage by phasing them into the current summation, eliminating the need for any wide, high-fanout carry chain during regular cycles.

Timing for each compute cycle is therefore:

Tcycle=TDRU+TCEL-1+TCEL-2+TCEL-3+TGENT_{\text{cycle}} = T_{\text{DRU}} + T_{\text{CEL-1}} + T_{\text{CEL-2}} + T_{\text{CEL-3}} + T_{\text{GEN}}

Compared to a conventional multiply-accumulate cycle:

TMAC=Tmult+TGEN+TPCPAT_{\text{MAC}} = T_{\text{mult}} + T_{\text{GEN}} + T_{\text{PCPA}}

The absence of TPCPAT_{\text{PCPA}} (the full carry-propagate adder) in all but the final cycle reduces the critical path by roughly the depth of the carry-chain logic, yielding a measured 10–20% cycle time reduction in practical deployment.

4. Residual Termination and Accurate Output Synthesis

In the final computation cycle (CendC_{\text{end}}), i.e., after processing the last batch of the last channel, NESTA re-enables the PCPA (the full CPA). At this point, it combines the previously stored approximate sum S~[Cend]\widetilde{S}[C_{\text{end}}] and the buffered residual R[Cend]R[C_{\text{end}}] to produce the exact output:

Sfinal=S~[Cend]+R[Cend]S_{\text{final}} = \widetilde{S}[C_{\text{end}}] + R[C_{\text{end}}]

This operation is performed over two cycles (marked as a multi-cycle path), and its single-use overhead is amortized over the large number of upstream cycles.

5. Performance and Quantitative Benchmarks

Empirical data from 32nm post-layout (for 9-input, 16-bit MAC operation) demonstrates the following measured results:

  • Critical path is reduced by 8–10% over the fastest MAC9 implementation (e.g., Kogge-Stone tree).
  • Energy per 9-input MAC batch decreases by 40–50%, due to replacement of binary adders with shallow Hamming weight compressors.
  • Throughput increases by 20–30% in deep-channel workloads, since almost all cycles avoid the full carry-propagate stage.

For comparison, a conventional approach using nine 16×16 multipliers and a 16-bit CPA per batch has a one-cycle TmultT_{\text{mult}} and a carry-propagate delay of approximately 3.1 ns. In contrast, in NESTA:

  • Multiplication is absorbed into the DRU + HWC network
  • The full-carry-propagate TPCPAT_{\text{PCPA}} (≈0.5 ns) is required only for the final cycle
  • The popcount HWC layers incur ≈2.4 ns total delay

These metrics support a roughly 2× improvement in power–delay product compared to conventional 9-input MAC units.

6. Architectural and Computational Significance

The Hamming Weight Phasing Approach decomposes all but the final cycle of large-bitwidth MAC computation into a sequence of compressed cycles, each formed from three shallow popcount layers and the light-weight GEN stage. By continuously “phasing” residual carry bits forward and reinjecting them as inputs, the algorithm ensures that no long carry chain affects the cycle timing until the very end of the operation.

This mechanism aligns with the architectural goal of maximizing throughput and energy efficiency in deep neural network accelerators, where channel depth and convolutional batch size permit amortization of rare, long-latency operations. The use of Hamming weight compressors (versus traditional binary adder trees) inherently reduces both logic depth and switching activity. A plausible implication is improved scalability to wider datapaths and higher MAC parallelism due to the bounded critical path and reduced per-cycle logic depth.

In summary, Hamming Weight Phasing as implemented in NESTA enables a streamed computation model for 3×3 MACs, in which only the GEN half of traditional CPA logic and shallow Hamming weight compressors run per cycle, while full carry-propagate is deferred and amortized, yielding lower energy per batch, increased throughput, and improved timing relative to state-of-the-art adder trees and CPA-based MAC blocks (Mirzaeian et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Hamming Weight Phasing Approach.