Papers
Topics
Authors
Recent
Search
2000 character limit reached

M4BRAM: In-BRAM Mixed-Precision DNN Acceleration

Updated 16 December 2025
  • M4BRAM is a compute-in-BRAM architecture that integrates bit-serial multiply-accumulate units within BRAM to support flexible, mixed-precision DNN inference.
  • It employs an in-memory data duplication and shuffling scheme to optimize weight and activation reuse across processing elements for various DNN layer requirements.
  • M4BRAM achieves over 2× speedup in DNN workloads with minimal accuracy loss, significantly enhancing FPGA accelerator efficiency while maintaining full memory functionality.

M4BRAM is a compute-in-block RAM (BRAM) architecture designed to accelerate mixed-precision matrix-matrix multiplication operations in field-programmable gate arrays (FPGAs), specifically targeting deep neural network (DNN) inference workloads. M4BRAM extends conventional BRAM tiles to deliver bit-serial multiply-accumulate (MAC) computing directly within the memory array, supporting flexible weight and activation quantization, and efficient data reuse schemes. The architecture facilitates high hardware utilization for mixed-precision neural network models and maintains full memory functionality, allowing simultaneous memory access and in-memory computation, thereby complementing existing digital signal processing (DSP) resources for DNN inference acceleration (Chen et al., 2023).

1. Microarchitecture and Mixed-Precision Support

M4BRAM augments each standard 20 Kb M20K (Intel) or 36 Kb RAMB36E1 (Xilinx) BRAM tile with four “dummy” BRAM arrays—one per in-BRAM processing element (BPE)—a lightweight microsequencer (eFSM), and a “duplication shuffler” capable of selectively replicating and shuffling 32-bit weight data across the four BPEs. Port A serves both as a conventional memory write port and, in compute mode, as the instruction port feeding operands into the eFSM and BPEs. Port B remains available as a read port for simultaneous access by other compute resources during M4BRAM operations.

M4BRAM supports weights (PwP_w) at 2, 4, or 8 bits and activations (PiP_i) with 2 to 8 bits precision. Weight precision is fixed at configuration, while activation precision is dynamically selectable on each compute-in-memory (CIM) instruction via a 4-bit byte-enable field in the data word. Each BPE implements a bit-serial multiplier-accumulator, running for (Pi+2)(P_i + 2) cycles in synchronous (single-clock) mode or (Pi2+2)(\frac{P_i}{2} + 2) cycles if double-pumped at twice the BRAM clock.

This configuration enables true mixed-precision operation, permitting per-layer or even per-inference flexibility in quantization, responding to DNN requirements without idle resource cycling (Chen et al., 2023).

2. In-BRAM Data Duplication and Parallelism

M4BRAM employs a configurable in-BRAM data duplication scheme to exploit both weight-sharing (NWN_W) and activation-sharing (NIN_I) per processing element, tailored to diverse DNN layer reuse patterns. For a given PwP_w, each 32-bit BRAM word packs L=32/PwL = 32/P_w weights, sliced and assigned to the four BPEs via the duplication shuffler. By choosing a duplication factor D{1,2,4}D \in \{1, 2, 4\}, any slice may be broadcast to 1, 2, or all 4 BPEs, setting NIN_I. For Pw=8P_w = 8, selectable (NW,NI){(4,1),(2,2),(1,4)}(N_W, N_I) \in \{(4,1), (2,2), (1,4)\} (doubled in the “large-dummy” variant).

The total multiply-accumulate operations per 32-bit load is:

Rtotal=#BPEs×(NWNI)=4×L=4×32Pw.R_{\text{total}} = \#\text{BPEs} \times (N_W \cdot N_I) = 4 \times L = 4 \times \frac{32}{P_w}.

This duplication scheme enables the architecture to flexibly match DNN layers characterized by either high weight or high activation reuse, optimizing utilization and supporting both convolutional and transformer models.

3. Accelerator Dataflow and System-Level Integration

In a tiled DNN accelerator employing M4BRAM, the dominant dataflow is weight-stationary, activation-broadcast—mirroring typical FPGA direct learning accelerator (DLA) protocols. Tiled DNN computation proceeds as follows:

  1. A 32-bit weight vector is loaded from DRAM into main BRAM.
  2. Four PiP_i-bit activations are sent in a 32-bit Port A write, cueing the eFSM to initiate a CIM instruction.
  3. The eFSM latches instruction data, configures duplication, and streams activations to BPEs.
  4. Over (Pi+2)(P_i + 2) cycles, each BPE performs bit-serial MAC operations, accumulating partial sums.
  5. Port B remains available during this operation so other fabric (e.g., DSPs) can fetch from BRAM concurrently.
  6. On completion, Port B reads retrieve the computed results for buffer or further reduction.

Workload partitioning between M4BRAM (bit-serial MAC) and conventional DSP (bit-parallel) engines follows the QVECQ_\text{VEC} output dimension, enabling heterogeneous parallel execution and output merging.

M4BRAM occupies only Port A during compute (two instruction cycles plus Pi+2P_i + 2 compute cycles); Port B access is decoupled and unaffected, ensuring full BRAM utility and supporting double-buffered dataflows, unlike previous compute-in-BRAM (CIM) approaches.

4. Performance Modeling and Hardware Efficiency

Theoretical throughput for a single M4BRAM tile operating at BRAM frequency ff is:

ThroughputMAC=f×4(32Pw)Pi+2[MACs/s]\mathrm{Throughput}_{\rm MAC} = \frac{f \times 4 \left(\frac{32}{P_w}\right)}{P_i + 2} \quad \text{[MACs/s]}

For double-pumped BPEs:

ThroughputMACDP=2f×4(32Pw)Pi/2+2\mathrm{Throughput}_{\rm MAC}^{\rm DP} = \frac{2f \times 4\left(\frac{32}{P_w}\right)}{P_i/2 + 2}

Scaling to TT tiles yields total throughput TThroughputMACT \cdot \mathrm{Throughput}_{\rm MAC}, contingent on off-chip bandwidth BDRAMT(f×32bits)B_{\text{DRAM}} \geq T \cdot (f \times 32\, \text{bits}) plus output reads. M4BRAM supports concurrent double-buffering and ping-pong arbitration, as Port A is only briefly occupied during eFSM operation.

When compared with DSP-only and prior CIM designs (BRAMAC, CCB, CoMeFa), M4BRAM-S introduces a +19.6% BRAM area overhead per tile, but only Port A is reserved for CIM. Full-chip replacement increases core area by approximately 5.6% on a 28%-BRAM-heavy FPGA, more than doubling mixed-precision MAC throughput. BRAMAC supports only uniform-precision operation and occupies both ports, restricting flexibility and throughput (<1.7× DSP baseline), whereas M4BRAM yields >2.1×>2.1\times.

5. Experimental Evaluation

Across standard ImageNet classification networks (e.g., AlexNet, VGG-16, ResNet-18, ViT-Base self-attention), inclusion of M4BRAM in a tiled DNN accelerator delivers an average 2.16× speedup relative to 8-bit DSP-based DLA, maintaining top-1 accuracy degradation below 0.5% for Pi6P_i \geq 6. Versus prior BRAMAC‐1DA, M4BRAM provides 1.43× greater performance.

Table: Representative Performance Metrics

Model BRAM Type Logic LUTs DSPs Used BRAMs fmaxf_{\max} Tile Latency (μs) Top-1 Loss Energy Eff. vs. DSP
AlexNet DLA (8/8) 55K 648 2×1K 300 MHz 1.2 0.00%
M4BRAM-S 60K 400 2×1K 280 MHz 0.55 0.28% 2.3×
ResNet-18 DLA (8/8) 63K 648 2×1K 300 MHz 2.4 0.00%
M4BRAM-L 68K 500 2×1K 275 MHz 1.1 0.45% 2.2×
ViT-Base DLA (8/8) 70K 648 2×1K 300 MHz 3.1 0.00%
(self-attn) M4BRAM-S (DP) 75K 300 2×1K 260 MHz 1.4 0.32% 2.5×

Key observations include greater than 2× latency reduction for minimal accuracy drop (<0.5% top-1).

6. Design Trade-offs and Implications

Reducing activation bit-width PiP_i yields near-linear gains in bit-serial M4BRAM throughput (Pi+2P_i+2 cycle depth), particularly when Pi6P_i \geq 6 dominates serial overhead. Lowering weight width PwP_w increases packing L=32/PwL=32/P_w, and thus multiply-accumulate parallelism (RtotalR_{\text{total}}), which DSP architectures cannot readily exploit for sub-8 bit weights.

Per-tile area overhead for M20K \rightarrow M4BRAM-S is +19.6%, but for FPGAs with 28% of core area comprising M20K, net core impact is ~5.6% for 2×–2.3× MAC throughput benefit. A plausible implication is that M4BRAM can efficiently address DNN workloads with dynamic, heterogeneous quantization requirements at minimal resource premium.

7. Architectural Innovations and Future Prospects

M4BRAM is the first compute-in-BRAM design to support true mixed-precision operation with runtime precision switches: Pw{2,4,8},Pi[2,8]P_w \in \{2,4,8\}, P_i \in [2,8]. The duplication shuffler mechanism enables efficient weight- and activation-sharing, matching (NW,NI)(N_W,N_I) to the prevailing reuse in DNN layers for maximized utilization. The split-port architecture makes M4BRAM the only CIM approach to leave Port B uninterrupted, solving long-standing data sharing and double-buffering issues in FPGA DNN engines.

This suggests M4BRAM forms an ideal building block for next-generation FPGA AI accelerators requiring fine-grained, layer-wise quantization control and robust hardware resource sharing. Its integration transforms passive on-chip memory to dynamically configurable compute arrays, delivering 2–3× DNN performance boost over conventional designs with modest area increases (Chen et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to M4BRAM.