RowClone: In-DRAM Bulk Data Copy and Zeroing

Updated 29 January 2026

RowClone is a processing-in-memory primitive that performs bulk copy and zeroing operations entirely within DRAM, bypassing CPU intervention.
It employs Fast Parallel Mode and Pipelined Serial Mode to deliver 12×–100× speedup and up to 74× energy reduction for 4 KB transfers.
Minimal DRAM and controller changes enable RowClone to enhance OS tasks, database operations, and cache optimizations in modern systems.

RowClone is a processing-in-memory (PIM) primitive that enables bulk data copy and initialization operations to be performed entirely within commodity DRAM chips, bypassing the traditional CPU-mediated path. By exploiting the innate organization and analog operational properties of DRAM subarrays and sense-amplifiers, RowClone leverages two complementary mechanisms to substantially reduce latency and energy for page-scale copy and zero operations throughout the memory hierarchy. Experimental and simulation evidence across multiple platforms demonstrates up to 12×–100× speedup and 74× reduction in energy for 4 KB copies, with practical system integration techniques detailed by numerous research groups (Seshadri et al., 2016, Seshadri, 2016, Seshadri et al., 2018, Olgun et al., 2022, Mutlu et al., 2019).

1. Motivation and Background

Bulk data movement—particularly page-scale copy (e.g., copy-on-write, fork, checkpointing) and zero-initialization (secure deallocation, allocation path hygiene)—accounts for a dominant fraction of execution time and DRAM energy in modern computing platforms. In conventional systems, these operations are orchestrated by issuing per-cache-line reads and writes over the off-chip memory bus, incurring significant latency and energy cost due to repeated data transfers traversing the channel. For instance, copying a 4 KB page in DDR4-2133 can require ~500 ns and tens to hundreds of μJ channel energy (Seshadri et al., 2016). RowClone’s approach exploits the internal DRAM hierarchy (cell, sense-amp, subarray, bank) to isolate all data movement within the chip, eliminating round-trip inefficiency on the memory channel and offering orders-of-magnitude improvement in throughput and energy utilization (Seshadri et al., 2018).

2. DRAM Architectural Principles Underlying RowClone

Commodity DRAM is hierarchically organized at several levels: individual bits are stored as charges on capacitors accessed by transistors (cell level), each column connects to a cross-coupled inverter pair forming the sense-amplifier (row buffer), multiple rows are grouped into subarrays (with each subarray sharing one sense-amplifier row), and banks aggregate subarrays interfaced through a command decoder and shared timing constraints (parameters $t_{\mathrm{RCD}}$ , $t_{\mathrm{RAS}}$ , $t_{\mathrm{RP}}$ ) (Seshadri et al., 2016, Seshadri et al., 2018). The memory controller orchestrates data movement by issuing PRECHARGE, ACTIVATE, READ, and WRITE commands. Standard operations shuttle each cache line (<64 B) to/from the off-chip bus on DDR modules, but RowClone’s mechanisms exploit the physical locality and analog behavior of the sense-amplifiers, operating only within the chip (Seshadri, 2016).

3. Operational Modes: Fast Parallel Mode and Pipelined Serial Mode

RowClone implements two modes to cover the dominant bulk-movement use cases:

Fast Parallel Mode (FPM): Operates across two rows in the same subarray, exploiting shared sense-amplifier resources. The copy sequence involves:

ACTIVATE the source row (loads source data into the sense amps);
ACTIVATE the destination row (without intervening PRECHARGE), connecting destination cells to the sense amps—this overwrites the destination with the source, since cells cannot override sense amps;
PRECHARGE closes the row buffer and finalizes the transfer (Seshadri et al., 2016, Olgun et al., 2022, Seshadri et al., 2018).

For a 4 KB copy, FPM achieves latencies of 50–90 ns ( $t_{\mathrm{RAS}} + t_{\mathrm{RP}}$ ). No data traverses the memory channel, vastly reducing energy (0.04–0.05 μJ per copy, ~74× less than conventional memcpy).

Pipelined Serial Mode (PSM): Applied for copies between different banks or subarrays. After both source and destination rows are activated, a sequence of internal transfers move one 64 B cache line at a time over the on-chip bus, represented as back-to-back read (from source) and write (to destination) operations; repeats across all columns of the row. Latency is halved relative to baseline memcpy (510–540 ns), with energy reductions of ~3× (Seshadri et al., 2016, Seshadri et al., 2018).

Mode	Applicability	Latency (4 KB copy)	Energy (4 KB copy)
FPM	Same subarray	50–90 ns	0.04–0.05 μJ
PSM	Inter-bank/subarray	510–540 ns	1.1 μJ
Baseline memcpy	Off-chip transfers	~1020–1046 ns	3.6 μJ

4. Implementation Requirements and Hardware/Software Changes

RowClone requires only modest modifications to the DRAM chip and memory controller (Seshadri, 2016, Mutlu et al., 2019, Olgun et al., 2022):

DRAM chip: Permit consecutive ACTIVATEs within a bank if targeting rows in the same subarray (enabled by tagging each row decoder with subarray ID). For PSM, a minimal add-on crossbar/latch connects source bank sense amps onto the global I/O bus and directly into the destination bank’s sense amps, all inside the die.
Memory controller: Flush and invalidate relevant cache lines before issuing RowClone commands; ensure FPM and PSM sequences are properly scheduled and tracked. No new pin-out, and only ACTIVATE/READ/WRITE/PRECHARGE DRAM channel commands are used.
Operating system (OS) and hypervisor: Page allocators must become subarray-aware to maximize FPM usage—i.e., allocate destination pages in the same subarray as source. Reserve “zero rows” per subarray to enable fast zeroing.
Instruction set architecture (ISA): New memcopy(src, dst, len) and meminit(dst, len, value) instructions can opportunistically trigger RowClone operations for aligned, page-sized primitives (Seshadri et al., 2018).

Empirical hardware cases (e.g., PiDRAM FPGA-prototype with Rocket Chip) confirm that RowClone can be implemented with only 198 lines of Verilog and 565 lines of C++, indicating very low integration complexity compared to traditional software memcpy logic (Olgun et al., 2022).

5. Quantitative Performance and Energy Analysis

Performance modeling and cycle-accurate simulation results from several works provide concrete metrics. For a 4 KB copy:

Baseline memcpy: 1020–1046 ns, 3.6 μJ.
RowClone-FPM: 85–90 ns, 0.04–0.05 μJ. (12×–11.6× speedup, 74–74.4× energy reduction).
RowClone-PSM (inter-bank): 510–540 ns, 1.1 μJ. (2× speedup, 3.2× energy reduction).

End-to-end system improvements include:

2× instruction-per-cycle (IPC) uplift in OS fork-heavy workloads (S=64 MB, N=1 K pages).
DRAM energy reduction up to 80% for copy/zero-intensive phases.
Weighted throughput increases up to 27% in multicore workloads, with corresponding drops in memory-energy per instruction and bandwidth/instruction (−28%) (Seshadri, 2016, Seshadri et al., 2018, Seshadri et al., 2016).

PiDRAM platform integration demonstrates up to 118.5× speedup (ideal case; source is coherent in DRAM) and 14.6×–12.6× improvement for copy/zero in realistic software stacks (Olgun et al., 2022).

6. Application Domains and System-Level Impact

RowClone fundamentally benefits any workload dominated by bulk memory movement:

Operating system primitives: Fork (copy-on-write), secure page deallocation (zeroing), process migration, checkpointing, virtualization snapshotting, live migration—all see order-of-magnitude latency and energy reductions.
Database engines: Bulk bitvector operations (bitmap AND/OR, e.g., FastBit indexing) enjoy ~30% query speedup when the accompanying IDAO substrate is employed (Seshadri et al., 2016).
Page migration and NUMA traffic management: PSM enables rapid, intra-DRAM page shuffling to resolve bank conflicts.
GPU–CPU and accelerator–host data sharing: Buffer migration without occupancy of channel bandwidth; in-place shared memory moves occur near the bandwidth limit of internal DRAM pathways.
General library implementations: libc memcpy/calloc retargeting to RowClone for aligned, page-sized regions.
Cache & coherence optimizations: RowClone-ZI handles in-cache copy and clean-zero insertion, amortizing misses (Seshadri et al., 2018).

7. Limitations, Trade-offs, and Future Directions

RowClone’s FPM is contingent upon source and destination rows residing in the same subarray; OS page allocators and copy-on-write mechanisms must provide subarray locality, which complicates memory management under fragmentation and heavy workload conditions (Seshadri et al., 2016, Mutlu et al., 2019, Olgun et al., 2022). Operations are restricted to atomic full-row (e.g., 4 KB) copies. Partial-row transfers or arbitrarily placed pages must fallback to PSM (with higher cost) or standard CPU memcpy.

RowClone relies on relaxing DRAM timing rules (back-to-back ACTIVATEs without intervening PRECHARGE), exposing reliability to variation in process, voltage, and temperature—comprehensive empirical validation is required. Integration into memory controller schedulers must ensure cache line coherence (flushing/invalidation), and page allocation constraints (subarray-awareness) introduce OS complexity.

A plausible implication is that future DRAM designs should offer adaptive mechanisms for finer-grained timing control, subarray-to-subarray communication, and broader in-DRAM primitive support (e.g., bitwise operations, scatter/gather) (Olgun et al., 2022, Seshadri et al., 2016). Mechanisms extending RowClone beyond DRAM (3D-stacked memory, logic-in-memory) remain under exploration (Mutlu et al., 2019).

RowClone represents a minimal yet highly effective architectural extension to conventional DRAM, enabling intrinsic in-memory data movement primitives through existing physical structures and providing transformative acceleration for system software and application workloads fundamentally where data resides.