BitROM Accelerator

Updated 2 April 2026

BitROM Accelerator is a compute-in-memory architecture that integrates bitwise and multiply-accumulate primitives within dense memory arrays using analog readout techniques.
It employs innovations like triple-row charge sharing and dual-contact cells for in-place logic, enabling operations such as AND, OR, NOT and ternary weight mapping for neural inference.
The design achieves significant performance gains with up to 25× throughput improvements and over 10× energy efficiency in large-scale applications, while incurring minimal area overhead.

A BitROM Accelerator is a physically optimized compute-in-memory or compute-in-Read-Only-Memory (CiROM) architecture designed to enable high-throughput, energy-efficient bitwise operations or dense weight storage and access directly within memory arrays. The BitROM family includes both DRAM-based in-memory logic designs (as exemplified by the Buddy architecture) and application-specific CiROM macros for ultra-compact neural inference, notably for billion-parameter LLMs using ultra-low-bit quantization. BitROM leverages analog phenomena within standard memory cells and custom readout circuits, including triple-row charge sharing for bitwise logic, sense-amplifier side access for in-place inversion, multi-mode accumulators for ternary compute, and near-memory or on-die buffer arrays (e.g., eDRAM) for fast working set management. Functionality, efficiency, and practical system integration are achieved through architectural, circuit, and workload co-design, tightly calibrated to the requirements of memory-centric high-density workloads (Seshadri et al., 2016, Zhang et al., 10 Sep 2025).

1. Core Principles: Bitwise and ROM-Compute Mechanisms

BitROM accelerators realize functionally-complete bitwise and/or multiply-accumulate primitives inside dense memory arrays via analog or mixed-signal readout techniques. DRAM-based BitROM (“Buddy-RAM”) relies on simultaneous triple-row activation to implement the bitwise majority function, which decomposes to AND/OR with proper row biasing. Specifically, if $V_{\rm rowA}$ , $V_{\rm rowB}$ , $V_{\rm rowC}\in\{0,V_{\rm DD}\}$ denote cell voltages, the shared bitline potential before sense-amplification is

$V_{SA} = \frac{2k-3}{6+2(C_b/C_c)}V_{DD}$

where $k$ is the number of charged rows. The sense amplifier latches to logic 1 if $k\geq 2$ (implementing the majority, or $A B + B C + CA$ ), and logic 0 otherwise. By precharging one row, AND ( $C=0$ ) or OR ( $C=1$ ) are realized deterministically.

Bitwise NOT is realized by exploiting the cross-coupled inverter structure of the sense amplifier: a dual-contact cell (DCC) is used to selectively sample either the true or complement side during activation, and copy the result back to DRAM rows, with timing coordinated by reserved wordlines. These mechanisms permit all standard bitwise primitives (AND, OR, NOT, NAND, NOR, XOR, XNOR) to be mapped into in-memory operation sequences.

For ROM-based accelerators targeting LLMs, the BitROM design employs a Bidirectional ROM Array (BiROMA): each transistor cell is read bidirectionally to encode and retrieve two independent ternary weights via source/drain symmetry and controlled pre-biasing. Weights are encoded as $w\in\{-1,0,+1\}$ by setting source lines to distinct voltage levels, and the direction of current during reading selects which logical weight is observed (Zhang et al., 10 Sep 2025).

2. Microarchitectural Innovations and System Integration

BitROM architectures comprise custom subarray decode logic, local-periphery accumulators, and integrated buffer arrays to support diverse operation modes. In the DRAM domain, the array is split between standard data rows and reserved “bitwise-control” rows, with a small “B-group decoder” permitting up to three simultaneous wordline activations and specialized access to DCCs. Operation sequencing (e.g., triple-row activation, ACTIVATE-ACTIVATE-PRECHARGE) is encoded into the memory controller without adding DRAM command opcodes, leveraging RowClone for fast rowcopy and analog logic steps in place (Seshadri et al., 2016).

For CiROM LLM accelerators, the macro includes both a BiROMA and a Tri-Mode Local Accumulator (TriMLA). The TriMLA operates in one of three modes: zero-skip $V_{\rm rowB}$ 0, positive $V_{\rm rowB}$ 1, or negative $V_{\rm rowB}$ 2, distributing add/subtract/skip control locally. Microarchitecturally, each group of columns shares one TriMLA, which accumulates the MAC outputs for its input slice and passes results to a global adder tree—trading off global adder width for modest peripherally-local area. On-die Decode-Refresh (DR) eDRAM supports sequence KV caching, automatically refreshing data at memory access without explicit scheduling, reducing effective DRAM access (Zhang et al., 10 Sep 2025).

3. Area, Density, and Overhead Analysis

The fundamental advantage of BitROM is its extremely low area overhead for logic capability within the memory array. For DRAM-based designs, reserving 14 out of 1024 rows per subarray (including bitwise control, DCC, and pre-bias rows) incurs an area overhead of only $V_{\rm rowB}$ 3: $V_{\rm rowB}$ 4 The DCC implementation adds the equivalent of four rows per subarray (two DCC rows ×2). Area penalty is dominated by reserved/non-data rows, not fundamental cell modifications.

In 65 nm CMOS, BitROM macros with BiROMA achieve a bit density of $V_{\rm rowB}$ 5 kb/mm $V_{\rm rowB}$ 6, representing a $V_{\rm rowB}$ 7 improvement over prior digital CiROM ( $V_{\rm rowB}$ 8 kb/mm $V_{\rm rowB}$ 9), and $V_{\rm rowC}\in\{0,V_{\rm DD}\}$ 0 over hybrid analog CiROM ( $V_{\rm rowC}\in\{0,V_{\rm DD}\}$ 1 kb/mm $V_{\rm rowC}\in\{0,V_{\rm DD}\}$ 2 normalized). This is attributed to bidirectional reading (2 weights/transistor) and minimal logic periphery. For billion-parameter models (e.g., Falcon3–1B), macro area requirements are on the order of $V_{\rm rowC}\in\{0,V_{\rm DD}\}$ 3 cm $V_{\rm rowC}\in\{0,V_{\rm DD}\}$ 4 (plus $V_{\rm rowC}\in\{0,V_{\rm DD}\}$ 5 cm $V_{\rm rowC}\in\{0,V_{\rm DD}\}$ 6 eDRAM) at a hypothetical 14 nm node, making edge LLM inference with multi-billion weights physically plausible (Zhang et al., 10 Sep 2025).

4. Performance Metrics and Evaluation Methodology

Performance and efficiency are characterized by throughput speedup and energy reduction relative to baseline memory-access-bound computation. These are quantified as

$V_{\rm rowC}\in\{0,V_{\rm DD}\}$ 7

For in-DRAM BitROM bitwise primitives, raw throughput improvements range from $V_{\rm rowC}\in\{0,V_{\rm DD}\}$ 8 (AND) to $V_{\rm rowC}\in\{0,V_{\rm DD}\}$ 9 (NOT), with corresponding energy cuts of $V_{SA} = \frac{2k-3}{6+2(C_b/C_c)}V_{DD}$ 0– $V_{SA} = \frac{2k-3}{6+2(C_b/C_c)}V_{DD}$ 1, across seven common operations (AND, OR, NOT, NAND, NOR, XOR, XNOR) (Seshadri et al., 2016). For CiROM LLM accelerators, macro-level energy efficiency reaches $V_{SA} = \frac{2k-3}{6+2(C_b/C_c)}V_{DD}$ 2 TOPS/W for 1.58 b/4 b inference, holding $V_{SA} = \frac{2k-3}{6+2(C_b/C_c)}V_{DD}$ 3 advantage against previous digital implementations.

The impact of on-die DR eDRAM for KV-cache is modeled as a reduction in external DRAM accesses. For sequence length $V_{SA} = \frac{2k-3}{6+2(C_b/C_c)}V_{DD}$ 4 and on-die buffer of $V_{SA} = \frac{2k-3}{6+2(C_b/C_c)}V_{DD}$ 5 earliest tokens, the reduction fraction is

$V_{SA} = \frac{2k-3}{6+2(C_b/C_c)}V_{DD}$ 6

which, for $V_{SA} = \frac{2k-3}{6+2(C_b/C_c)}V_{DD}$ 7, $V_{SA} = \frac{2k-3}{6+2(C_b/C_c)}V_{DD}$ 8, yields a $V_{SA} = \frac{2k-3}{6+2(C_b/C_c)}V_{DD}$ 9 cut in DRAM reads during decoding.

5. Real-World Workloads and System-Level Case Studies

In-DRAM BitROM function was validated on multiple data-intensive applications (Seshadri et al., 2016):

Bitmap indices: End-to-end query time reduced by $k$ 0 over SIMD-optimized CPU.
BitWeaving-V column scan: Scan time reduction of $k$ 1 on average ( $k$ 2– $k$ 3 across $k$ 4– $k$ 5-bit fields).
Bit-vector set operations versus RB-trees: For $k$ 6 sets, bitset operation time at $k$ 7– $k$ 8 the RB-tree baseline, giving $k$ 9 speedup for $k\geq 2$ 0.

For CiROM BitROM accelerators (Zhang et al., 10 Sep 2025):

Billion-parameter LLM inference: Direct weight mapping at 1.58 b quantization enables edge inference within single-SoC area/power envelope.
Zero-shot/transfer learning: Hardware-integrated LoRA adapters (rank 16, $k\geq 2$ 1-bit weights, $k\geq 2$ 2 overhead) allow flexible adaptation across tasks with minimal resource impact; observed EM/F1 improvement of up to $k\geq 2$ 3 points.
External access minimization: On-die KV-cache enables up to $k\geq 2$ 4 off-chip DRAM access reduction for typical conversational context lengths $k\geq 2$ 5, accelerating decode throughput and reducing energy for edge devices.

6. Generalization, Adaptations, and Limitations

BitROM principles extend beyond DRAM to any passive storage array sharing a small number of amplifiers with multi-row charge sharing and limited logic periphery. Potential adaptations include:

3D-stacked DRAM with near-memory processing: Integrating the “bitwise-control” logic layer underneath the main array.
Non-volatile analog memories: Techniques such as ReRAM or PCM can exploit analogous triple-cell majority for logic.
Wider multi-bit cells: Multi-row activation generalizes to multi-bit-per-row logic, subject to analog margin constraints.

Limitations:

Granularity: Operations occur row-wide (kilobytes), unsuited for fine-grained bitwise tasks.
Data-type inflexibility: CiROM macros are dedicated to ternary weights; higher-precision requires array/circuit redesign.
Reserved resource cost: Control rows reduce effective array utilization; dynamic allocation may alleviate this.
ECC, refresh, and temperature: Integration with error-correcting codes, refresh scheduling, and thermal variation remains an open circuit validation challenge.

7. Impact and Deployment Context

BitROM accelerators have established new computational paradigms for memory-centric workloads:

Enable up to $k\geq 2$ 6 bulk bitwise throughput and $k\geq 2$ 7 DRAM-side energy efficiency improvements with $k\geq 2$ 8 area overhead (Seshadri et al., 2016).
Make billion-parameter LLM inference feasible on compact (<1 cm³) edge devices by leveraging 1.58-b quantization, ternary compute, and near-memory caching (Zhang et al., 10 Sep 2025).
Provide hardware-based support for rapid, low-bandwidth transfer learning via LoRA, suitable for federated models and adaptive edge intelligence.
Serve as a template for in-memory acceleration in other dense data domains, motivating further research into analog-centric architectures with minimal periphery.

The BitROM Accelerator exemplifies the power of memory-logic co-design, leveraging physical effects for computational acceleration without compromising density or efficiency, and opening new application domains in edge inference, data analytics, and memory-constrained computing.

Markdown Report Issue Upgrade to Chat

References (2)

Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM (2016)

BitROM: Weight Reload-Free CiROM Architecture Towards Billion-Parameter 1.58-bit LLM Inference (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BitROM Accelerator.