Papers
Topics
Authors
Recent
Search
2000 character limit reached

FIGCache: In-DRAM Caching via FIGARO

Updated 18 March 2026
  • FIGCache is a fully in-DRAM caching mechanism that uses FIGARO to enable fine-grained, distance-independent relocation of 64B row-segments within DRAM banks.
  • It incorporates a small fast region per bank and a fully-associative tag store to optimize cache hit rates, achieving a 16.3% speedup and 7.8% DRAM energy reduction in evaluated eight-core workloads.
  • The design incurs minimal area and power overhead while significantly increasing DRAM row-buffer hit rates, making it a practical solution for high-bandwidth, memory-intensive systems.

FIGCache is a fully in-DRAM caching mechanism implemented atop the FIGARO substrate, which enables fine-grained, distance-independent data relocation between subarrays within a DRAM bank. FIGCache caches only frequently-accessed segments—termed row-segments—of DRAM rows, improving cache utilization and DRAM row-buffer hit rates while minimizing relocation and area overheads. Evaluated on DDR4-800 systems, FIGCache demonstrates a 16.3% average weighted speedup and 7.8% reduction in total DRAM energy for eight-core mem-intensive workloads compared to conventional DRAM systems without in-DRAM caching (Wang et al., 2020).

1. FIGARO: Enabling Subarray-Level Fine-Grained Relocation

FIGCache relies on FIGARO, a substrate that leverages existing DRAM bank architecture—composed of multiple subarrays, each coupling 512–2048 cells per sense-amplifier row buffer (LRB)—to provide efficient intra-bank data movement. DRAM subarrays within a bank share a global row buffer (GRB), usually 64 bits wide. FIGARO requires only two minor logic augmentations per subarray: a second latch on the local row decoder, allowing dual activation of source and destination subarrays, and a 2-to-1 column decoder multiplexer to select different column addresses for each.

With these additions, the memory controller uses a new command, RELOC(src col, dst subarray & col), to execute block relocations via the following steps:

  1. Activate the source row (ACT_S) in subarray S.
  2. Use the GRB to transfer a single 64 B column from S to D, enabling D’s sense amplifiers to capture the bits (RELOC).
  3. Activate the destination row (ACT_D) to commit the data.
  4. PRECHARGE the bank.

The RELOC operation’s latency is fixed at 1 ns—due to low GRB parasitic capacitance—irrespective of source-destination subarray distance. Total one-block relocation latency is

trelocate=tACT_S+tRELOC+tACT_D+tPREt_{relocate} = t_{ACT\_S} + t_{RELOC} + t_{ACT\_D} + t_{PRE}

with typical DDR4 parameters tACT35nst_{ACT} \approx 35\,\text{ns}, tPRE13nst_{PRE} \approx 13\,\text{ns}, tRELOC=1nst_{RELOC} = 1\,\text{ns}, yielding trelocate84nst_{relocate} \approx 84\,\text{ns} worst-case. This contrasts with prior bulk-move substrates, where relocation latency scales linearly with distance (Wang et al., 2020).

2. Architecture and Data Organization

FIGCache designates a small “fast” region in each bank, instantiated by either deploying short-bitline subarrays or reserving rows within an existing subarray. Each fast region accommodates NN cache-rows (default N=64N=64 per bank), logically divided into MM row-segments (M=8M=8 default, each 16 × 64 B = 1 KiB per segment). Relocations of 64 B segments between slow and fast regions use FIGARO RELOC operations, incurring no off-chip traffic.

Address decomposition for a CPU request occurs as:

chanrankbankrowsegIDoffset|\text{chan}|\,|\text{rank}|\,|\text{bank}|\,|\text{row}|\,|\text{segID}|\,|\text{offset}|

where

segID=byte_offsetmodRowSizeSegSize\text{segID} = \left\lfloor \frac{\text{byte\_offset} \bmod \text{RowSize}}{\text{SegSize}} \right\rfloor

A fully-associative tag store (FTS), with 512 entries per bank in the memory controller, maintains metadata for each cached row-segment: valid bit (V), dirty bit (D), tag (row, segID, ≈19 bits), and a 5-bit benefit counter.

3. Caching Policy and Replacement

Upon memory access, the controller checks FTS[bank] for a tag match with valid bit set:

  • On a hit, the request is routed to the fast region.
  • On a miss, the slow region services the request, then up to MM back-to-back RELOC operations move the requested row-segment into the fast region. FTS metadata is updated accordingly (V=1V=1, D=0D=0, benefit=1).

Caching operates at the granularity of 64 B blocks (per RELOC) and 1 KiB row-segments. On every hit, benefit[c]min(benefit[c]+1,127)benefit[c] \leftarrow \min(benefit[c] + 1, 127). Write operations set D1D \leftarrow 1. If the fast region is full, FIGCache applies a row-granularity "RowBenefit" eviction: it evicts the fast-region row with the smallest sum of benefit[seg] values, then within that row evicts the segment with the lowest benefit. This grouping of temporally correlated row-segments within a cache row increases fast-region row buffer hit rate by approximately 20% compared to previous approaches.

4. Address Mapping, Access Path, and Operation

The access path in FIGCache proceeds as follows:

  1. Parse (chan,rank,bank,row,segID,off)(\text{chan}, \text{rank}, \text{bank}, \text{row}, \text{segID}, \text{off}) from the physical address.
  2. For each ee in FTS[bank]:
    • If e.Ve.V and e.tag==(row,segID)e.\text{tag} == (\text{row}, \text{segID}):
      • Compute addrfast=(chan,rank,bank,fast_base_row+e.index,off)addr_{fast} = (\text{chan}, \text{rank}, \text{bank}, \text{fast\_base\_row} + e.\text{index}, \text{off})
      • Issue DRAM read to addrfastaddr_{fast} (cache hit)
  3. If no match:
    • Compute addrslow=(chan,rank,bank,row,off)addr_{slow} = (\text{chan}, \text{rank}, \text{bank}, \text{row}, \text{off})
    • Perform DRAM read from addrslowaddr_{slow} (cache miss)
    • Move row-segment from slow to fast region
    • Update FTS entry accordingly

Evictions write back dirty data using RELOC in reverse.

5. Performance and Energy Impact

Empirical evaluation with DDR4-800 MT/s (1 channel, 16 banks, 64 subarrays/bank), using an eight-core system and 20 multiprogrammed workloads, yields:

  • Average weighted speedup: 16.3% over baseline (8-core, 100% memory-intensive workloads)
  • DRAM energy reduction: 7.8% (combined static + dynamic)
  • In-DRAM cache hit rate: approximately 75–80%
  • System row buffer hit rate: increased by approximately 18%

Average DRAM latency is governed by:

Lavg=HfcLfc+(1Hfc)LslowL_{avg} = H_{fc} \cdot L_{fc} + (1 - H_{fc}) \cdot L_{slow}

where HfcH_{fc} is fast-cache hit rate, LfctRCD_fast+tCL_fastL_{fc} \approx t_{RCD\_fast} + t_{CL\_fast}, and LslowtRCD_slow+tCL_slowL_{slow} \approx t_{RCD\_slow} + t_{CL\_slow}.

Energy savings are calculated as Esavings=EbaseEFIGCacheE_{savings} = E_{base} - E_{FIGCache} (Wang et al., 2020).

6. Area, Power Overheads, and Trade-offs

  • FIGARO logic per DRAM chip contributes under 0.3% area overhead; two fast subarrays add 0.7%. Alternatively, reserving rows in an existing subarray incurs 0.2% area cost.
  • Memory controller FTS: 512×26512 \times 26 bits/bank, totaling 16 kB/channel, with a CACTI area of approximately 0.5 mm².
  • Controller FTS power: ≈0.2 mW; DRAM per-subarray overhead ≈10 μW.

Distance-independent, per-block relocation latency allows FIGCache to adapt rapidly to shifting hotspots, providing much of the performance benefit of low-latency DRAM at significantly lower area and power cost.

7. Context and Significance

FIGCache addresses two central inefficiencies in prior in-DRAM caches: coarse data relocation granularity (entire multi-kilobyte rows) and relocation latency scaling with inter-subarray distance. By introducing fine-grained (64 B) and distance-independent relocation via FIGARO, and by architecting a caching mechanism that selectively packs "hot" segments from diverse rows, FIGCache fundamentally increases effective DRAM cache utilization and hit rates. Its implementation requires modest DRAM and controller modifications, offering a practical path to realize low-latency, high-efficiency in-DRAM caching for modern high-bandwidth multicore systems (Wang et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FIGCache.