FIGCache: In-DRAM Caching via FIGARO

Updated 18 March 2026

FIGCache is a fully in-DRAM caching mechanism that uses FIGARO to enable fine-grained, distance-independent relocation of 64B row-segments within DRAM banks.
It incorporates a small fast region per bank and a fully-associative tag store to optimize cache hit rates, achieving a 16.3% speedup and 7.8% DRAM energy reduction in evaluated eight-core workloads.
The design incurs minimal area and power overhead while significantly increasing DRAM row-buffer hit rates, making it a practical solution for high-bandwidth, memory-intensive systems.

FIGCache is a fully in-DRAM caching mechanism implemented atop the FIGARO substrate, which enables fine-grained, distance-independent data relocation between subarrays within a DRAM bank. FIGCache caches only frequently-accessed segments—termed row-segments—of DRAM rows, improving cache utilization and DRAM row-buffer hit rates while minimizing relocation and area overheads. Evaluated on DDR4-800 systems, FIGCache demonstrates a 16.3% average weighted speedup and 7.8% reduction in total DRAM energy for eight-core mem-intensive workloads compared to conventional DRAM systems without in-DRAM caching (Wang et al., 2020).

1. FIGARO: Enabling Subarray-Level Fine-Grained Relocation

FIGCache relies on FIGARO, a substrate that leverages existing DRAM bank architecture—composed of multiple subarrays, each coupling 512–2048 cells per sense-amplifier row buffer (LRB)—to provide efficient intra-bank data movement. DRAM subarrays within a bank share a global row buffer (GRB), usually 64 bits wide. FIGARO requires only two minor logic augmentations per subarray: a second latch on the local row decoder, allowing dual activation of source and destination subarrays, and a 2-to-1 column decoder multiplexer to select different column addresses for each.

With these additions, the memory controller uses a new command, RELOC(src col, dst subarray & col), to execute block relocations via the following steps:

Activate the source row (ACT_S) in subarray S.
Use the GRB to transfer a single 64 B column from S to D, enabling D’s sense amplifiers to capture the bits (RELOC).
Activate the destination row (ACT_D) to commit the data.
PRECHARGE the bank.

The RELOC operation’s latency is fixed at 1 ns—due to low GRB parasitic capacitance—irrespective of source-destination subarray distance. Total one-block relocation latency is

$t_{relocate} = t_{ACT\_S} + t_{RELOC} + t_{ACT\_D} + t_{PRE}$

with typical DDR4 parameters $t_{ACT} \approx 35\,\text{ns}$ , $t_{PRE} \approx 13\,\text{ns}$ , $t_{RELOC} = 1\,\text{ns}$ , yielding $t_{relocate} \approx 84\,\text{ns}$ worst-case. This contrasts with prior bulk-move substrates, where relocation latency scales linearly with distance (Wang et al., 2020).

2. Architecture and Data Organization

FIGCache designates a small “fast” region in each bank, instantiated by either deploying short-bitline subarrays or reserving rows within an existing subarray. Each fast region accommodates $N$ cache-rows (default $N=64$ per bank), logically divided into $M$ row-segments ( $M=8$ default, each 16 × 64 B = 1 KiB per segment). Relocations of 64 B segments between slow and fast regions use FIGARO RELOC operations, incurring no off-chip traffic.

Address decomposition for a CPU request occurs as:

$|\text{chan}|\,|\text{rank}|\,|\text{bank}|\,|\text{row}|\,|\text{segID}|\,|\text{offset}|$

where

$\text{segID} = \left\lfloor \frac{\text{byte\_offset} \bmod \text{RowSize}}{\text{SegSize}} \right\rfloor$

A fully-associative tag store (FTS), with 512 entries per bank in the memory controller, maintains metadata for each cached row-segment: valid bit (V), dirty bit (D), tag (row, segID, ≈19 bits), and a 5-bit benefit counter.

3. Caching Policy and Replacement

Upon memory access, the controller checks FTS[bank] for a tag match with valid bit set:

On a hit, the request is routed to the fast region.
On a miss, the slow region services the request, then up to $M$ back-to-back RELOC operations move the requested row-segment into the fast region. FTS metadata is updated accordingly ( $V=1$ , $D=0$ , benefit=1).

Caching operates at the granularity of 64 B blocks (per RELOC) and 1 KiB row-segments. On every hit, $benefit[c] \leftarrow \min(benefit[c] + 1, 127)$ . Write operations set $D \leftarrow 1$ . If the fast region is full, FIGCache applies a row-granularity "RowBenefit" eviction: it evicts the fast-region row with the smallest sum of benefit[seg] values, then within that row evicts the segment with the lowest benefit. This grouping of temporally correlated row-segments within a cache row increases fast-region row buffer hit rate by approximately 20% compared to previous approaches.

4. Address Mapping, Access Path, and Operation

The access path in FIGCache proceeds as follows:

Parse $(\text{chan}, \text{rank}, \text{bank}, \text{row}, \text{segID}, \text{off})$ from the physical address.
For each $e$ $e$ in FTS[bank]:
- If $e.V$ $e . V$ and $e.\text{tag} == (\text{row}, \text{segID})$ $e . tag == (row, segID)$ :
  - Compute $addr_{fast} = (\text{chan}, \text{rank}, \text{bank}, \text{fast\_base\_row} + e.\text{index}, \text{off})$
  - Issue DRAM read to $addr_{fast}$ (cache hit)
If no match:
- Compute $addr_{slow} = (\text{chan}, \text{rank}, \text{bank}, \text{row}, \text{off})$
- Perform DRAM read from $addr_{slow}$ (cache miss)
- Move row-segment from slow to fast region
- Update FTS entry accordingly

Evictions write back dirty data using RELOC in reverse.

5. Performance and Energy Impact

Empirical evaluation with DDR4-800 MT/s (1 channel, 16 banks, 64 subarrays/bank), using an eight-core system and 20 multiprogrammed workloads, yields:

Average weighted speedup: 16.3% over baseline (8-core, 100% memory-intensive workloads)
DRAM energy reduction: 7.8% (combined static + dynamic)
In-DRAM cache hit rate: approximately 75–80%
System row buffer hit rate: increased by approximately 18%

Average DRAM latency is governed by:

$L_{avg} = H_{fc} \cdot L_{fc} + (1 - H_{fc}) \cdot L_{slow}$

where $H_{fc}$ is fast-cache hit rate, $L_{fc} \approx t_{RCD\_fast} + t_{CL\_fast}$ , and $L_{slow} \approx t_{RCD\_slow} + t_{CL\_slow}$ .

Energy savings are calculated as $E_{savings} = E_{base} - E_{FIGCache}$ (Wang et al., 2020).

6. Area, Power Overheads, and Trade-offs

FIGARO logic per DRAM chip contributes under 0.3% area overhead; two fast subarrays add 0.7%. Alternatively, reserving rows in an existing subarray incurs 0.2% area cost.
Memory controller FTS: $512 \times 26$ bits/bank, totaling 16 kB/channel, with a CACTI area of approximately 0.5 mm².
Controller FTS power: ≈0.2 mW; DRAM per-subarray overhead ≈10 μW.

Distance-independent, per-block relocation latency allows FIGCache to adapt rapidly to shifting hotspots, providing much of the performance benefit of low-latency DRAM at significantly lower area and power cost.

7. Context and Significance

FIGCache addresses two central inefficiencies in prior in-DRAM caches: coarse data relocation granularity (entire multi-kilobyte rows) and relocation latency scaling with inter-subarray distance. By introducing fine-grained (64 B) and distance-independent relocation via FIGARO, and by architecting a caching mechanism that selectively packs "hot" segments from diverse rows, FIGCache fundamentally increases effective DRAM cache utilization and hit rates. Its implementation requires modest DRAM and controller modifications, offering a practical path to realize low-latency, high-efficiency in-DRAM caching for modern high-bandwidth multicore systems (Wang et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

FIGARO: Improving System Performance via Fine-Grained In-DRAM Data Relocation and Caching (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FIGCache.