FIGCache: In-DRAM Caching via FIGARO
- FIGCache is a fully in-DRAM caching mechanism that uses FIGARO to enable fine-grained, distance-independent relocation of 64B row-segments within DRAM banks.
- It incorporates a small fast region per bank and a fully-associative tag store to optimize cache hit rates, achieving a 16.3% speedup and 7.8% DRAM energy reduction in evaluated eight-core workloads.
- The design incurs minimal area and power overhead while significantly increasing DRAM row-buffer hit rates, making it a practical solution for high-bandwidth, memory-intensive systems.
FIGCache is a fully in-DRAM caching mechanism implemented atop the FIGARO substrate, which enables fine-grained, distance-independent data relocation between subarrays within a DRAM bank. FIGCache caches only frequently-accessed segments—termed row-segments—of DRAM rows, improving cache utilization and DRAM row-buffer hit rates while minimizing relocation and area overheads. Evaluated on DDR4-800 systems, FIGCache demonstrates a 16.3% average weighted speedup and 7.8% reduction in total DRAM energy for eight-core mem-intensive workloads compared to conventional DRAM systems without in-DRAM caching (Wang et al., 2020).
1. FIGARO: Enabling Subarray-Level Fine-Grained Relocation
FIGCache relies on FIGARO, a substrate that leverages existing DRAM bank architecture—composed of multiple subarrays, each coupling 512–2048 cells per sense-amplifier row buffer (LRB)—to provide efficient intra-bank data movement. DRAM subarrays within a bank share a global row buffer (GRB), usually 64 bits wide. FIGARO requires only two minor logic augmentations per subarray: a second latch on the local row decoder, allowing dual activation of source and destination subarrays, and a 2-to-1 column decoder multiplexer to select different column addresses for each.
With these additions, the memory controller uses a new command, RELOC(src col, dst subarray & col), to execute block relocations via the following steps:
- Activate the source row (ACT_S) in subarray S.
- Use the GRB to transfer a single 64 B column from S to D, enabling D’s sense amplifiers to capture the bits (RELOC).
- Activate the destination row (ACT_D) to commit the data.
- PRECHARGE the bank.
The RELOC operation’s latency is fixed at 1 ns—due to low GRB parasitic capacitance—irrespective of source-destination subarray distance. Total one-block relocation latency is
with typical DDR4 parameters , , , yielding worst-case. This contrasts with prior bulk-move substrates, where relocation latency scales linearly with distance (Wang et al., 2020).
2. Architecture and Data Organization
FIGCache designates a small “fast” region in each bank, instantiated by either deploying short-bitline subarrays or reserving rows within an existing subarray. Each fast region accommodates cache-rows (default per bank), logically divided into row-segments ( default, each 16 × 64 B = 1 KiB per segment). Relocations of 64 B segments between slow and fast regions use FIGARO RELOC operations, incurring no off-chip traffic.
Address decomposition for a CPU request occurs as:
where
A fully-associative tag store (FTS), with 512 entries per bank in the memory controller, maintains metadata for each cached row-segment: valid bit (V), dirty bit (D), tag (row, segID, ≈19 bits), and a 5-bit benefit counter.
3. Caching Policy and Replacement
Upon memory access, the controller checks FTS[bank] for a tag match with valid bit set:
- On a hit, the request is routed to the fast region.
- On a miss, the slow region services the request, then up to back-to-back RELOC operations move the requested row-segment into the fast region. FTS metadata is updated accordingly (, , benefit=1).
Caching operates at the granularity of 64 B blocks (per RELOC) and 1 KiB row-segments. On every hit, . Write operations set . If the fast region is full, FIGCache applies a row-granularity "RowBenefit" eviction: it evicts the fast-region row with the smallest sum of benefit[seg] values, then within that row evicts the segment with the lowest benefit. This grouping of temporally correlated row-segments within a cache row increases fast-region row buffer hit rate by approximately 20% compared to previous approaches.
4. Address Mapping, Access Path, and Operation
The access path in FIGCache proceeds as follows:
- Parse from the physical address.
- For each in FTS[bank]:
- If and :
- Compute
- Issue DRAM read to (cache hit)
- If and :
- If no match:
- Compute
- Perform DRAM read from (cache miss)
- Move row-segment from slow to fast region
- Update FTS entry accordingly
Evictions write back dirty data using RELOC in reverse.
5. Performance and Energy Impact
Empirical evaluation with DDR4-800 MT/s (1 channel, 16 banks, 64 subarrays/bank), using an eight-core system and 20 multiprogrammed workloads, yields:
- Average weighted speedup: 16.3% over baseline (8-core, 100% memory-intensive workloads)
- DRAM energy reduction: 7.8% (combined static + dynamic)
- In-DRAM cache hit rate: approximately 75–80%
- System row buffer hit rate: increased by approximately 18%
Average DRAM latency is governed by:
where is fast-cache hit rate, , and .
Energy savings are calculated as (Wang et al., 2020).
6. Area, Power Overheads, and Trade-offs
- FIGARO logic per DRAM chip contributes under 0.3% area overhead; two fast subarrays add 0.7%. Alternatively, reserving rows in an existing subarray incurs 0.2% area cost.
- Memory controller FTS: bits/bank, totaling 16 kB/channel, with a CACTI area of approximately 0.5 mm².
- Controller FTS power: ≈0.2 mW; DRAM per-subarray overhead ≈10 μW.
Distance-independent, per-block relocation latency allows FIGCache to adapt rapidly to shifting hotspots, providing much of the performance benefit of low-latency DRAM at significantly lower area and power cost.
7. Context and Significance
FIGCache addresses two central inefficiencies in prior in-DRAM caches: coarse data relocation granularity (entire multi-kilobyte rows) and relocation latency scaling with inter-subarray distance. By introducing fine-grained (64 B) and distance-independent relocation via FIGARO, and by architecting a caching mechanism that selectively packs "hot" segments from diverse rows, FIGCache fundamentally increases effective DRAM cache utilization and hit rates. Its implementation requires modest DRAM and controller modifications, offering a practical path to realize low-latency, high-efficiency in-DRAM caching for modern high-bandwidth multicore systems (Wang et al., 2020).