M4BRAM: In-BRAM Mixed-Precision DNN Acceleration
- M4BRAM is a compute-in-BRAM architecture that integrates bit-serial multiply-accumulate units within BRAM to support flexible, mixed-precision DNN inference.
- It employs an in-memory data duplication and shuffling scheme to optimize weight and activation reuse across processing elements for various DNN layer requirements.
- M4BRAM achieves over 2× speedup in DNN workloads with minimal accuracy loss, significantly enhancing FPGA accelerator efficiency while maintaining full memory functionality.
M4BRAM is a compute-in-block RAM (BRAM) architecture designed to accelerate mixed-precision matrix-matrix multiplication operations in field-programmable gate arrays (FPGAs), specifically targeting deep neural network (DNN) inference workloads. M4BRAM extends conventional BRAM tiles to deliver bit-serial multiply-accumulate (MAC) computing directly within the memory array, supporting flexible weight and activation quantization, and efficient data reuse schemes. The architecture facilitates high hardware utilization for mixed-precision neural network models and maintains full memory functionality, allowing simultaneous memory access and in-memory computation, thereby complementing existing digital signal processing (DSP) resources for DNN inference acceleration (Chen et al., 2023).
1. Microarchitecture and Mixed-Precision Support
M4BRAM augments each standard 20 Kb M20K (Intel) or 36 Kb RAMB36E1 (Xilinx) BRAM tile with four “dummy” BRAM arrays—one per in-BRAM processing element (BPE)—a lightweight microsequencer (eFSM), and a “duplication shuffler” capable of selectively replicating and shuffling 32-bit weight data across the four BPEs. Port A serves both as a conventional memory write port and, in compute mode, as the instruction port feeding operands into the eFSM and BPEs. Port B remains available as a read port for simultaneous access by other compute resources during M4BRAM operations.
M4BRAM supports weights () at 2, 4, or 8 bits and activations () with 2 to 8 bits precision. Weight precision is fixed at configuration, while activation precision is dynamically selectable on each compute-in-memory (CIM) instruction via a 4-bit byte-enable field in the data word. Each BPE implements a bit-serial multiplier-accumulator, running for cycles in synchronous (single-clock) mode or cycles if double-pumped at twice the BRAM clock.
This configuration enables true mixed-precision operation, permitting per-layer or even per-inference flexibility in quantization, responding to DNN requirements without idle resource cycling (Chen et al., 2023).
2. In-BRAM Data Duplication and Parallelism
M4BRAM employs a configurable in-BRAM data duplication scheme to exploit both weight-sharing () and activation-sharing () per processing element, tailored to diverse DNN layer reuse patterns. For a given , each 32-bit BRAM word packs weights, sliced and assigned to the four BPEs via the duplication shuffler. By choosing a duplication factor , any slice may be broadcast to 1, 2, or all 4 BPEs, setting . For , selectable (doubled in the “large-dummy” variant).
The total multiply-accumulate operations per 32-bit load is:
This duplication scheme enables the architecture to flexibly match DNN layers characterized by either high weight or high activation reuse, optimizing utilization and supporting both convolutional and transformer models.
3. Accelerator Dataflow and System-Level Integration
In a tiled DNN accelerator employing M4BRAM, the dominant dataflow is weight-stationary, activation-broadcast—mirroring typical FPGA direct learning accelerator (DLA) protocols. Tiled DNN computation proceeds as follows:
- A 32-bit weight vector is loaded from DRAM into main BRAM.
- Four -bit activations are sent in a 32-bit Port A write, cueing the eFSM to initiate a CIM instruction.
- The eFSM latches instruction data, configures duplication, and streams activations to BPEs.
- Over cycles, each BPE performs bit-serial MAC operations, accumulating partial sums.
- Port B remains available during this operation so other fabric (e.g., DSPs) can fetch from BRAM concurrently.
- On completion, Port B reads retrieve the computed results for buffer or further reduction.
Workload partitioning between M4BRAM (bit-serial MAC) and conventional DSP (bit-parallel) engines follows the output dimension, enabling heterogeneous parallel execution and output merging.
M4BRAM occupies only Port A during compute (two instruction cycles plus compute cycles); Port B access is decoupled and unaffected, ensuring full BRAM utility and supporting double-buffered dataflows, unlike previous compute-in-BRAM (CIM) approaches.
4. Performance Modeling and Hardware Efficiency
Theoretical throughput for a single M4BRAM tile operating at BRAM frequency is:
For double-pumped BPEs:
Scaling to tiles yields total throughput , contingent on off-chip bandwidth plus output reads. M4BRAM supports concurrent double-buffering and ping-pong arbitration, as Port A is only briefly occupied during eFSM operation.
When compared with DSP-only and prior CIM designs (BRAMAC, CCB, CoMeFa), M4BRAM-S introduces a +19.6% BRAM area overhead per tile, but only Port A is reserved for CIM. Full-chip replacement increases core area by approximately 5.6% on a 28%-BRAM-heavy FPGA, more than doubling mixed-precision MAC throughput. BRAMAC supports only uniform-precision operation and occupies both ports, restricting flexibility and throughput (<1.7× DSP baseline), whereas M4BRAM yields .
5. Experimental Evaluation
Across standard ImageNet classification networks (e.g., AlexNet, VGG-16, ResNet-18, ViT-Base self-attention), inclusion of M4BRAM in a tiled DNN accelerator delivers an average 2.16× speedup relative to 8-bit DSP-based DLA, maintaining top-1 accuracy degradation below 0.5% for . Versus prior BRAMAC‐1DA, M4BRAM provides 1.43× greater performance.
Table: Representative Performance Metrics
| Model | BRAM Type | Logic LUTs | DSPs Used | BRAMs | Tile Latency (μs) | Top-1 Loss | Energy Eff. vs. DSP | |
|---|---|---|---|---|---|---|---|---|
| AlexNet | DLA (8/8) | 55K | 648 | 2×1K | 300 MHz | 1.2 | 0.00% | 1× |
| M4BRAM-S | 60K | 400 | 2×1K | 280 MHz | 0.55 | 0.28% | 2.3× | |
| ResNet-18 | DLA (8/8) | 63K | 648 | 2×1K | 300 MHz | 2.4 | 0.00% | 1× |
| M4BRAM-L | 68K | 500 | 2×1K | 275 MHz | 1.1 | 0.45% | 2.2× | |
| ViT-Base | DLA (8/8) | 70K | 648 | 2×1K | 300 MHz | 3.1 | 0.00% | 1× |
| (self-attn) | M4BRAM-S (DP) | 75K | 300 | 2×1K | 260 MHz | 1.4 | 0.32% | 2.5× |
Key observations include greater than 2× latency reduction for minimal accuracy drop (<0.5% top-1).
6. Design Trade-offs and Implications
Reducing activation bit-width yields near-linear gains in bit-serial M4BRAM throughput ( cycle depth), particularly when dominates serial overhead. Lowering weight width increases packing , and thus multiply-accumulate parallelism (), which DSP architectures cannot readily exploit for sub-8 bit weights.
Per-tile area overhead for M20K M4BRAM-S is +19.6%, but for FPGAs with 28% of core area comprising M20K, net core impact is ~5.6% for 2×–2.3× MAC throughput benefit. A plausible implication is that M4BRAM can efficiently address DNN workloads with dynamic, heterogeneous quantization requirements at minimal resource premium.
7. Architectural Innovations and Future Prospects
M4BRAM is the first compute-in-BRAM design to support true mixed-precision operation with runtime precision switches: . The duplication shuffler mechanism enables efficient weight- and activation-sharing, matching to the prevailing reuse in DNN layers for maximized utilization. The split-port architecture makes M4BRAM the only CIM approach to leave Port B uninterrupted, solving long-standing data sharing and double-buffering issues in FPGA DNN engines.
This suggests M4BRAM forms an ideal building block for next-generation FPGA AI accelerators requiring fine-grained, layer-wise quantization control and robust hardware resource sharing. Its integration transforms passive on-chip memory to dynamically configurable compute arrays, delivering 2–3× DNN performance boost over conventional designs with modest area increases (Chen et al., 2023).