Position-Independent Caching (PIC)

Updated 22 November 2025

Position-Independent Caching (PIC) is a paradigm that enables modular, position-agnostic reuse of cached entries, boosting efficiency in ML and communication systems.
It leverages selective recomputation and modular linking to mitigate cross-boundary artifacts, thus reducing latency while preserving inference accuracy.
PIC systems integrate dynamic recomputation with probabilistic caching, facilitating scalable context composition and effective wireless content delivery.

Position-Independent Caching (PIC) is a generalized caching paradigm in which cache entries (such as transformer Key-Value tensors or content objects) are designed to be reusable irrespective of the absolute or relative position they appear in a downstream request. Originally motivated by limitations in prefix-based context caching for LLM and Multimodal LLM (MLLM) serving, PIC has emerged as an essential technique to accelerate inference, enable modular composition of contexts, and facilitate efficient wireless content delivery. Its core principle is to allow recombination and reuse of cached representations across varying prompt or content positions, with accuracy maintained through strategic selective recomputation or probabilistic placement.

1. Formal Definition and Theoretical Foundations

Traditional context caching in transformers is restricted to prefix matches: a cache of key-value (KV) pairs $\{K_j,V_j\}_{j=1}^p$ can only be reused if the first $p$ tokens in the new prompt exactly match the cached sequence. PIC removes this restriction, decoupling cache reuse from fixed positions by structuring the cache into modular chunks (e.g., documents, few-shot exemplars, multimodal items) and enabling arbitrary concatenation or ordering at inference time.

Mathematically, consider for LLMs:

A set $S = \{C_1,\cdots,C_M\}$ of static token chunks, each cached via $KV_i = (K_i, V_i)$ for chunk $C_i$ .
A user request specifies a concatenation order $\mathrm{IID} = [i_1,\cdots,i_{M'}]$ and dynamic (uncached) tokens $D$ .
The prompt to the model is $P = C_{i_1} \Vert \cdots \Vert C_{i_{M'}} \Vert D$ .

The challenge is to compose a new joined KV cache $(K_\text{full},V_\text{full})$ in $O(N)$ time, where $N$ is prompt length, such that cross-chunk boundary artifacts and position-dependent discrepancies are mitigated [$2410.15332$].

In stochastic wireless caching networks, position-independence is realized through probabilistic content placement; each helper node independently caches file $i$ with probability $p_i$ , and users retrieve from the helper that maximizes channel quality, decoupling file availability from spatial positions of helpers or requests [$1606.00124$].

2. Algorithmic Realizations of PIC

State-of-the-art PIC systems employ integrated reuse-and-selective-recompute mechanisms. In the context of MLLM serving, the MPIC system operates as follows [$2502.01960$]:

Retrieve precomputed $KV^\text{cache}$ for each cached content module, possibly from disk, DRAM, or GPU RAM.
Recompute $KV^\text{rec}$ for a small set $S$ of tokens (typically all prompt text, plus a prefix $k$ tokens per image or chunk) that are sensitive to absolute or cross-context positions.
Merge caches by patching:

$K_j^\text{link} = \begin{cases} K_j^\text{rec}, & j\in D \ K_j^\text{cache}, & j\notin D \end{cases}$

and similarly for $V_j$ ; $D$ enumerates tokens requiring recompute.

The LegoLink (“AttnLink”) algorithm in EPIC similarly selects $k$ tokens on both sides of each chunk boundary for recomputation, correcting the "attention sink" phenomenon by allowing tokens adjacent to boundaries to flexibly attend across chunks [$2410.15332$]. Complexity is reduced to $O(kN)$ where $k\ll N$ .

In stochastic caching helper networks, the optimization problems for $p_i$ are formulated subject to cache capacity constraints, where position-independence ensures each request distributes load based on probabilistic placement, yielding analytical solutions for maximum average content delivery probability [$1606.00124$].

3. System Architectures for Position-Independent Caching

Modern PIC-enabled systems consist of several modular components:

Component	Description	Example Implementation
KVGen / Cache Creator	Preprocesses static modules to produce & store KV pairs	Pici’s KVGen, MPIC’s Static Library
KVLink / Linker	Retrieves IDs and merges modules with selective recompute	EPIC’s KVLink (LegoLink), MPIC’s Linker
Dynamic Content	Handles uncached user queries or tokens	Prompt token recompute
Scheduler / Retriever	Determines module selection for each request	Dynamic Library & Retriever (MPIC)
Decoder	Executes fast per-step generation over PIC-stitched context	vLLM, LLaVA pipelines

These architectures enable parallel I/O and compute overlapping: cached chunks are loaded while concurrent recomputation of selected tokens proceeds, hiding data transfer latency and optimizing overall serving time [$2502.01960$].

4. Empirical Performance and Evaluation

Extensive experiments demonstrate the superiority of PIC-based systems over traditional prefix or block-based caching.

MLLM Serving (MPIC, $2502.01960$; EPIC, $2410.15332$):
- MPIC-32 reduces Time-to-First-Token (TTFT) by up to 54.1% versus prefix caching, maintaining answer quality within a 13.6% drop in worst case (GPT Score evaluation), outperforming CacheBlend and scaling robustly with increasing image count in prompts.
- EPIC’s LegoLink achieves up to 8× improvements in TTFT and 7× throughput over existing context caching, with accuracy drops below 7% even at high chunk boundary recompute rates. Performance is robust across multiple LLMs and benchmarks.
- For both systems, TTFT grows only linearly with prompt length (rather than quadratically) and remains stable under multi-user loads.
Wireless Networks ($1606.00124$):
- Position-independent probabilistic placement yields higher content delivery success probability $P_s$ than most-popular- and uniform-caching for all parameter regimes (popularity skew $\gamma$ , helper density $\lambda$ , cache size $M$ ).
- Analytical results provide closed-form, convex-optimizable solutions for $p_i$ in both noise- and interference-limited regimes.

5. Underlying Principles and Phenomena

PIC relies on fundamental observations about model attention and information flow:

In transformers, attention maps are highly sparse—most mass is localized to the first few tokens of each static chunk or image, and only those require frequent recomputation for position alignment [$2502.01960$].
The "attention sink" at chunk boundaries (tokens monopolizing attention due to being first in a segment) must be neutralized for effective cross-context learning; recomputing $k$ tokens around boundaries is sufficient to destroy this artifact [$2410.15332$].
In wireless caching, position-independent placement enables spatial channel diversity and reduces interference by avoiding spatially correlated cache placements [$1606.00124$].

6. Limitations and Future Directions

Explicitly recognized limitations across recent works include:

Accuracy Trade-offs: Insufficiently large $k$ in recomputation can yield up to 13.6% drop in generation quality; optimal $k$ may be context- and layer-dependent [$2502.01960$].
Static Chunking: Chunk boundaries that split semantic units can harm accuracy; semantic-aware splitting is favored for maximal benefit [$2410.15332$].
Storage Overhead: KV cache for large or high-res static items (e.g., images, long docs) can be several gigabytes, necessitating disk-backed or compressed representations [$2502.01960$].

Important directions for further research include adaptive $k$ -selection based on runtime attention maps, dynamic chunk splitting, cache compression/quantization, extension of PIC to richer modalities (video, audio, 3D), and advanced scheduling on the retrieval and linker modules.

7. Applications Beyond Model Inference

While PIC emerged from deep learning inference, its principles are broadly applicable:

Retrieval-Augmented Generation (RAG) and few-shot learning, where context is dynamically recombined from a library of stored exemplars or document passages [$2410.15332$].
Stochastic wireless caching networks, wherein probabilistic, position-independent placement maximizes average content delivery under varying network interference patterns [$1606.00124$].
Prospective applicability includes real-time multi-modal streaming, federated edge-caching, and any scenario where modular, reusable context or content blocks are fundamental.

Position-Independent Caching thus constitutes a unifying abstraction that enables modular, compositional reuse with minimal redundant computation, unlocking scalable, low-latency serving across both machine learning and communications domains.

PDF Markdown Chat (Pro)

References (3)

EPIC: Efficient Position-Independent Caching for Serving Large Language Models (2024)

Caching Placement in Stochastic Wireless Caching Helper Networks: Channel Selection Diversity via Caching (2016)

MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving (2025)

Follow Topic

Get notified by email when new papers are published related to Position-Independent Caching (PIC).