Pickle Prefetcher: Scalable LLC Accelerator

Updated 2 December 2025

Pickle Prefetcher is a programmable, scalable prefetching accelerator that leverages software kernels to dynamically adapt to irregular memory access patterns.
It integrates five key hardware modules—including a prefetch generator, MMU, and victim cache—to reduce demand load latency by up to 250 cycles.
Evaluations show standalone speedups of 1.32× and hybrid benefits up to 1.74× in graph analytics, with timely prefetch ratios reaching 50–80%.

The Pickle Prefetcher is a programmable, scalable prefetching accelerator designed to improve the performance of last-level cache (LLC) architectures, particularly for workloads exhibiting irregular memory access patterns. It departs from conventional hardware-centric prefetch prediction by shifting the responsibility for defining prefetch strategies to software, enabling robust handling of data access patterns prevalent in modern graph analytics and similar applications. Rather than relying on static heuristic or complex hardware prediction, Pickle empowers user-level code to inject prefetch “hints” and customize kernel routines, enhancing adaptability while maintaining a minimal area overhead (Nguyen et al., 25 Nov 2025).

1. System Architecture and Principal Components

The Pickle Prefetcher comprises five primary hardware modules, coherently integrated into the many-core mesh interconnect. Each component addresses a distinct role in the end-to-end prefetch process:

Component	Role	Notable Features
PicklePG (Prefetch Generator)	Hardware function dispatcher, runs software kernel per event	Clocked at 4 GHz; <0.4 mm² area
Pending Prefetch Hint Queue	Buffers UC-store hints from processor cores	FIFO or EDF priority, ~128–256 entries
Pending Prefetch Queue (PQ)	Orders prefetch VA requests post-kernel, triggers translation	Priority-ordered, ~128–256 entries
PickleMMU	Private MMU: TLB and PTW for VA→PA translation	64-entry L1 TLB, 1 024-entry L2 TLB
Pickle Cache	Victim-style cache for prefetched lines	256 KiB, 16-way, 1 024 outstanding reqs

Data-path operation begins with software issuing an uncacheable (UC) store to a reserved address, which PicklePG interprets as a hint event. The PicklePG then executes a loaded software kernel, emitting prefetches into the PQ. After MMU translation, prefetches are sent as loads to the Pickle Cache. On a miss, the line is fetched from DRAM or LLC and, upon eviction, is written back to the shared LLC, preserving cache locality for future demand accesses.

2. Programmable Software Interface

Pickle Prefetcher exposes a memory-mapped programming model rather than extending the ISA. Application code maps a 4 KiB UC page and writes 64-bit hints using standard stores. The runtime supplies the device pointer via getPickleDevicePtr(). Each store includes discriminant bits allowing up to $N$ kernel handlers to be selected by the hardware.

A canonical example is breadth-first search (BFS) on graphs, where nodes enqueued for later expansion can issue “distance- $d$ ” prefetch hints:

1
2
3

for each node u popped from work_queue:
    if work_queue.size() ≥ PF_DIST:
        UC_store(device_ptr, address_of work_queue[u + PF_DIST])

Software kernels, typically <30 lines of C-like code, run inside PicklePG on each event. The handler dispatches prefetches recursively, walking data structures such as adjacency lists for graphs and utilizing mailbox-style calls:

Proc PrefetchEventHandler(event):
    if event == PREFETCH_HINT:
        vaddr = hint_data
        Prefetch(vaddr)
    elseif event == PREFETCH_RESPONSE:
        // Issue more prefetches for indirect structures as needed

3. Hardware–Software Tradeoffs

Traditional prefetchers embed hardware logic for pattern recognition (e.g., stride detect, finite automata), which often fails under irregular or data-dependent patterns. In contrast, Pickle reduces hardware to a minimal event-driven scheduler and dispatcher, letting software define and evolve prefetch strategies per application and dataset. Hardware thus focuses on scheduling, translation, and caching, minimizing area and verification burden.

A crucial trade-off is the shift in correctness and safety responsibilities to software: application-supplied kernels must be error-free to avoid system misfetches or crashes. This approach also enables rapid kernel iteration without hardware respin cycles, facilitating prefetch strategy optimization tailored to workload specifics.

4. Mathematical Models and Scheduling Policy

Pickle Prefetcher employs simple, effective queueing and scheduling mechanisms with explicit developer control. Key metrics and policies include:

Upside Captured: Measures efficiency relative to an idealized, infinite-L3 prefetcher.

$\text{UpsideCaptured} = \frac{\text{Speedup}_{\text{pickle}} - 1}{\text{Speedup}_{\text{ideal}} - 1}$

Timeliness Ratio: Prefetches completed before use.

$\text{TimelyPrefetchRatio} = \frac{\# \text{ of completed prefetch tasks}}{\# \text{ of all prefetch tasks}}$

Dropping Policy: To prevent resource congestion and late prefetches, only hints meeting

$\text{current\_index} - u \geq \text{DropThreshold}$

are activated, where $u$ is the current work index and $\text{DropThreshold}$ is a tunable parameter. The Pending Prefetch and Hint Queues are scheduled with earliest-deadline-first (EDF) ordering, giving priority to older prefetch hints.

5. Evaluation and Performance Results

The Pickle Prefetcher was evaluated using full-system gem5 simulation with an ARM ISA and CHI protocol. The baseline consists of eight 4 GHz out-of-order cores with private L1/L2 caches and a shared LLC, alongside multiple private prefetchers (Stride, AMPM, IMP).

Key quantitative findings include:

Standalone Speedup: Pickle alone achieves a geometric mean speedup of 1.32× over baseline (minimum 1.13×, maximum 1.47×).
Hybrid Speedup: With Stride, AMPM, or IMP prefetchers active in private caches and Pickle in LLC, speedups reach up to 1.74× on certain graph benchmarks.
Upside vs. Ideal: Pickle captures on average ~60% of the ideal LLC prefetch upside, with up to 70% on some graphs.
Timeliness: Without prefetch dropping, <30% of prefetches are timely, sometimes causing slowdowns. With $\text{DropThreshold}=16$ , the timely ratio rises to 50–80% and all tested graphs achieve at least 1× speedup.
Latency: Average demand load-to-use latency is reduced by 150–250 cycles.
DRAM to Cache Shift: A substantial fraction of DRAM requests are converted into Pickle Cache or LLC hits (moving from ~330 cycle DRAM accesses to 40–60 cycle cache accesses).
NoC Utilization: LLC-link utilization increases from ~20% to 60% under Pickle.

Private cache prefetchers alone can incur slowdowns, but in combination with Pickle, they consistently show performance improvements.

6. Use Case: Irregular Workloads in Graph Analytics

Pickle’s capabilities are exemplified in the context of BFS on compressed sparse row (CSR) graphs. Irregular memory patterns arise from indirection across work queues, neighbor pointer arrays, neighbor lists, and visited bitmaps. Pickle’s programmable kernels can efficiently prefetch these multi-level pointer chains, significantly mitigating DRAM bottlenecks:

For the LiveJournal graph (4.8M nodes, 69M edges), Pickle reduces DRAM accesses by approximately 60% and improves performance by 1.4×.
Prefetch distance and drop policies are tuned per graph to maximize timely, useful prefetches.

This programmable approach is immediately applicable to other domains characterized by independent, irregular access sequences.

7. Limitations and Prospective Research

The Pickle Prefetcher, while effective for irregular load-dominant workloads, exhibits inherent limitations:

Prefetch is restricted to load operations. Store prefetches are not supported and will be dropped if page faults occur.
Correct operation relies on the correctness of the software kernel; errors may result in system crashes or incorrect prefetch behavior.
The current PickleMMU lacks isolation mechanisms, exposing potential security concerns if multiple untrusted applications concurrently utilize the prefetcher. Future research is required to enable kernel sandboxing.
Adaptive scheduling remains an open avenue. The fixed DropThreshold policy could be refined with dynamically latency-aware or more sophisticated EDF strategies to approach the >80% upside observed in ideal prefetching scenarios.
Validation beyond graph analytics, such as in sparse-matrix computations and heterogeneous architectures, is pending.

In conclusion, by trading fixed-function hardware prediction for a lightweight, programmable interface, Pickle Prefetcher attains robust and scalable LLC prefetching for irregular memory patterns with minimal hardware resource consumption (Nguyen et al., 25 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Pickle Prefetcher: Programmable and Scalable Last-Level Cache Prefetcher (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Pickle Prefetcher.