Hamming Weight Phasing in NESTA
- Hamming Weight Phasing is a hardware technique that reformulates 3×3 convolution arithmetic using hierarchical Hamming weight compressors and phased carry deferral.
- It minimizes logic depth by replacing deep binary adder trees with shallow popcount circuits, yielding an 8–10% reduced critical path and 40–50% energy savings per MAC batch.
- The approach buffers and reinjects residual carries across cycles to enable a streamed computation model that improves throughput by about 20–30% in deep-channel workloads.
The Hamming Weight Phasing Approach is a hardware computation technique implemented in the NESTA neural processing engine to accelerate convolutional deep neural network layers by reformulating the arithmetic of 3×3 convolutions through a hierarchy of Hamming weight compressors, temporary carry deferral, and judicious phasing of addition operations. This approach leverages bit-level parallelism and carry buffering to minimize the cycle-critical path and improve energy efficiency, throughput, and timing over conventional adder-based MAC units in deep convolution processing (Mirzaeian et al., 2019).
1. Hierarchical Hamming-Weight Compressors for 3×3 Convolutions
NESTA processes each input patch—resulting in 9 multiply-accumulate (MAC) pairs—by generating partial products that are bit-aligned to form vectors at each bit position. For word-width , each bit column () thus contains up to 9 aligned bits . Instead of using a 9-input binary adder tree (e.g., Brent-Kung or Kogge-Stone, each requiring about full carry-propagate levels), NESTA employs a three-layer “Compression-and-Expansion Logic” (CEL) network of Hamming weight compressors.
The compressor hierarchy operates as follows:
- CEL-1: Each set of 9 bits at bit position is fed to a C_HW(9:4) compressor, producing a 4-bit binary Hamming weight which is then re-aligned, distributing each output bit to its corresponding bit position in the sum.
- CEL-2: Each bit-column, now with up to 4 or 7 bits post-realignment, is compressed using C_HW(4:3)—or, in the improved design, CC_HW(7:3), which more fully exploits compression for up to 7 bits.
- CEL-3: Resulting columns, each with at most 2 bits, are finalized using a standard 2-input adder or the “generate” (GEN) half of a carry-propagate adder (CPA).
Mathematically, denoting the partial products at position and cycle as :
- = HWC() for CEL-1
- = HWC() for CEL-2
- Output in CEL-3
This compression strategy reduces physical logic depth, replaces multiple layers of binary adders with popcount circuits, and eliminates deep carry propagation except in the final computation stage.
2. Approximate Partial Sums and Residual Carry Representation
After the final layer of the CEL network, each bit position outputs two binary signals per cycle:
- (“generate”): the current sum’s low bit
- (“propagate”): the carry toward
Over word width , collect these to form two -bit vectors per convolution cycle :
- (approximate partial sum)
- (residual or deferred carry)
The correct sum for the 9×3×3 MAC batch is ; however, NESTA defers the addition of and instead buffers it for future cycles, effectively skipping the final carry-propagate-adder (PCPA) stage for all but the last cycle.
3. Carry-Deferral via Phasing Mechanism
The central element of Hamming weight phasing is the delay, or “deferral,” of full carry propagation. Instead of immediately routing each residual (carry) bit through a width- CPA chain (which would extend the cycle’s critical path), these carry bits are saved into a cycle-indexed Carry-Buffer Unit (CBU), .
In the subsequent cycle (), when a new group of 9 partial product bits arrives at bit position , NESTA injects the deferred carry and the previously generated into the CEL-1 compressor as additional input bits. Thus, each HWC layer “consumes” the buffered residuals from the prior stage by phasing them into the current summation, eliminating the need for any wide, high-fanout carry chain during regular cycles.
Timing for each compute cycle is therefore:
Compared to a conventional multiply-accumulate cycle:
The absence of (the full carry-propagate adder) in all but the final cycle reduces the critical path by roughly the depth of the carry-chain logic, yielding a measured 10–20% cycle time reduction in practical deployment.
4. Residual Termination and Accurate Output Synthesis
In the final computation cycle (), i.e., after processing the last batch of the last channel, NESTA re-enables the PCPA (the full CPA). At this point, it combines the previously stored approximate sum and the buffered residual to produce the exact output:
This operation is performed over two cycles (marked as a multi-cycle path), and its single-use overhead is amortized over the large number of upstream cycles.
5. Performance and Quantitative Benchmarks
Empirical data from 32nm post-layout (for 9-input, 16-bit MAC operation) demonstrates the following measured results:
- Critical path is reduced by 8–10% over the fastest MAC9 implementation (e.g., Kogge-Stone tree).
- Energy per 9-input MAC batch decreases by 40–50%, due to replacement of binary adders with shallow Hamming weight compressors.
- Throughput increases by 20–30% in deep-channel workloads, since almost all cycles avoid the full carry-propagate stage.
For comparison, a conventional approach using nine 16×16 multipliers and a 16-bit CPA per batch has a one-cycle and a carry-propagate delay of approximately 3.1 ns. In contrast, in NESTA:
- Multiplication is absorbed into the DRU + HWC network
- The full-carry-propagate (≈0.5 ns) is required only for the final cycle
- The popcount HWC layers incur ≈2.4 ns total delay
These metrics support a roughly 2× improvement in power–delay product compared to conventional 9-input MAC units.
6. Architectural and Computational Significance
The Hamming Weight Phasing Approach decomposes all but the final cycle of large-bitwidth MAC computation into a sequence of compressed cycles, each formed from three shallow popcount layers and the light-weight GEN stage. By continuously “phasing” residual carry bits forward and reinjecting them as inputs, the algorithm ensures that no long carry chain affects the cycle timing until the very end of the operation.
This mechanism aligns with the architectural goal of maximizing throughput and energy efficiency in deep neural network accelerators, where channel depth and convolutional batch size permit amortization of rare, long-latency operations. The use of Hamming weight compressors (versus traditional binary adder trees) inherently reduces both logic depth and switching activity. A plausible implication is improved scalability to wider datapaths and higher MAC parallelism due to the bounded critical path and reduced per-cycle logic depth.
In summary, Hamming Weight Phasing as implemented in NESTA enables a streamed computation model for 3×3 MACs, in which only the GEN half of traditional CPA logic and shallow Hamming weight compressors run per cycle, while full carry-propagate is deferred and amortized, yielding lower energy per batch, increased throughput, and improved timing relative to state-of-the-art adder trees and CPA-based MAC blocks (Mirzaeian et al., 2019).