Compositional Caching (ComCa) Strategies

Updated 14 January 2026

Compositional Caching (ComCa) is a modular caching paradigm that decomposes complex computations into independent sub-units for selective reuse.
It is applied across deep learning, IR pipelines, multiprocessor systems, and verification, delivering significant gains in memory efficiency and speed.
Core algorithms enable cache merging, privatization, and template extraction to ensure both empirical performance improvements and theoretical consistency.

Compositional Caching (ComCa) is a paradigm that supports modular and reuse-oriented caching mechanisms across a wide variety of computational contexts, from deep learning inference and IR pipelines to multiprocessors, automated verification, computer vision, and web architectures. Across these domains, ComCa enables the decomposition of complex computation into independent or semi-independent sub-units whose intermediate results can be selectively cached, merged, or recomposed, thus improving memory efficiency, predictability, and computational reuse. The following sections survey the principal models, algorithms, analytical properties, and application domains for Compositional Caching, each referencing established technical frameworks and empirical results.

1. Fundamental Models and Definitions

Compositional Caching is instantiated as a set of structured caching strategies tailored to the specifics of each domain:

KVCompose for LLMs: ComCa in KVCompose operates by aggregating attention-based token importance across layers and heads, constructing composite tokens via head-wise selection, and allocating a global retention budget to maximize informative token coverage within the constrained memory footprint for KV caches (Akulov et al., 5 Sep 2025).
Information Retrieval Pipelines: In PyTerrier, ComCa combines automatic, implicit caching of common pipeline prefixes (longest common prefix analysis) with declarative explicit stage-wise caches, thereby reducing redundant computation across variations of retrieval experiments (MacAvaney et al., 14 Apr 2025).
Parallel Shared-Memory Systems: CCache privatizes commutative data at the cache line level, allowing parallel threads to perform updates independently and later merge results via associative operators, enabling consistency without traditional coherence or locking overhead (Balaji et al., 2017).
Compositional Verification (MDPs): Pareto caching in compositional value iteration maintains per-component under/over-approximations of achievable weights, enabling efficient and sound reuse of partial verification results in model-checking of composite MDPs (Watanabe et al., 2024).
Open-vocabulary Attribute Detection: ComCa leverages LLM-guided attribute-object compatibility to construct compositional caches of image exemplars with soft attribute labels, refining VLM predictions in a model-agnostic, training-free framework (Garosi et al., 24 Mar 2025).
Dynamic Web Content: Vcache decomposes dynamic scripts into templates (cacheable) and bindings (per-request, non-cacheable), using a plug-in protocol for hierarchical reconstruction of full HTML pages (Goyal et al., 2010).
Multiprocessor Embedded Systems: ComCa enforces compositionality by statically partitioning the shared cache into task- and buffer-private regions, thus eliminating unpredictable contention and enabling analytical, integer-program-based cache allocation (0710.4658).

2. Core Algorithms and Mathematical Frameworks

Distinct ComCa mechanisms are formalized via algorithmic and mathematical models:

Token Importance and Compression (KVCompose):
- Aggregates raw attention weights $A^{(\ell,h,c,m)}$ into importance scores $S^{(\ell,h,c)}$ using pooling and head-mean augmentation.
- Sorts token positions by importance per head, then constructs composite tokens that maximize aggregate retention under a fixed global compression ratio $r=1 - (\sum_\ell N_\ell)/(L\cdot N)$ , ensuring uniform per-layer cache dimensionality and preserving hardware compatibility (Akulov et al., 5 Sep 2025).
IR Pipeline Decomposition (PyTerrier):
- Computes the Longest Common Prefix (LCP) among a set of pipelines, executes prefix stages once, and caches results for subsequent pipeline divergences.
- Exposes explicit Transformer subclasses (e.g., KeyValueCache, ScorerCache) for targeted sub-stage caching, handled via standard API and serializable artifact infrastructure (MacAvaney et al., 14 Apr 2025).
On-Demand Privatization and Merge (CCache):
- Introduces per-line CCache metadata, a per-core Source Buffer, and a merge function register.
- Formalizes the merge as $v_{\text{mem}}^{\text{final}} = v_{\text{mem}}^0 \oplus (\bigoplus_{c=1}^C \Delta^{(c)})$ for commutative operations, ensuring correctness independent of merge schedule (Balaji et al., 2017).
Pareto Caching in Model Checking:
- Each component $A$ maintains Pareto cache entries $(L_i, U_i)$ .
- When a query is posed, computes $\underline{v}_i = \sup_{p\in L_i} w\cdot p$ and $\overline{v}_i = \sup_{p\in U_i} w\cdot p$ ; triggers local value iteration only on cache miss, updating the generators of $L_i$ , $U_i$ accordingly (Watanabe et al., 2024).
Flexible Open-Vocabulary Attribute Detection:
- Constructs a compositional cache $\mathcal{C}_H$ via a compatibility function $\varphi(a,o)$ , sampled over attributes and objects using LLM and corpus evidence.
- Aggregates attribute predictions at inference using a soft-blend score $\hat{s}_a(x_c)$ and image similarities, regularized and combined with backbone VLM outputs (Garosi et al., 24 Mar 2025).
Dynamic Web Template Caching (Vcache):
- Employs a fragmentor to extract templates identified by gap and loop markers; bindings encapsulate run-specific content.
- Reassembly at the client is governed by Plug operators and recursive expansion over template hierarchies (Goyal et al., 2010).
Cache Partitioning for Predictable Multiprocessing:
- Models the cache allocation $S_i$ for each entity, subject to global budget and exclusivity constraints, minimized for off-chip misses via integer linear programming (ILP) or greedy heuristics (0710.4658).

3. Analytical Guarantees and Performance Properties

ComCa yields both empirical and theoretical advances verified in targeted benchmarks:

KVCompose achieves compression ratios up to 80.9% with AUC = 82.3, outperforming competing methods such as TOVA, SnapKV, and PyramidKV. Performance is maintained under ±20% error tolerance across architectures (Qwen2, Qwen3, LLaMA-3.1) (Akulov et al., 5 Sep 2025).
In PyTerrier IR pipelines, implicit and explicit caching decrease runtime by up to 50% (hot cache, MSMARCO v1) compared to no caching (MacAvaney et al., 14 Apr 2025).
CCache yields speedups up to 3.2× over fine-grained locking, reduces L3 misses by 2.5–3×, and achieves predictable memory overhead (<0.1% LLC area for Source Buffer) (Balaji et al., 2017).
Pareto caching in MDP verification achieves up to 100× speedup, with up to 70% reduction in repeated local VI calls and cache hit rates between 50%–90% (Watanabe et al., 2024).
In open-vocabulary attribute detection, ComCa (with CLIP ViT-B/32) yields mAP improvements of +10.4 (OVAD) and +8.1 (VAW) over zero-shot, outperforming all other cache-based approaches and competing with training-based baselines (Garosi et al., 24 Mar 2025).
Multiprocessor cache partitioning reduces miss rates by 5–6.5× and improves CPI per core by 20% (JPEG/canny) and 4% (MPEG-2) compared to shared cache (0710.4658).

4. Architectural and Implementation Strategies

ComCa strategies are designed for minimal disruption and high compatibility:

KVCompose maintains fixed tensor layouts, does not require custom CUDA kernels, and operates as a drop-in replacement for standard cache extension routines in Transformer inference stacks (Akulov et al., 5 Sep 2025).
PyTerrier’s prefix precompute requires only a configuration flag and no pipeline re-writing. Explicit cache wrappers preserve pipeline compositionality and end-to-end declarativity (MacAvaney et al., 14 Apr 2025).
CCache implements hardware-only privatization and merging with low overhead, requiring minimal changes to cache controller logic, and no additional memory allocation per-thread (Balaji et al., 2017).
MDP model checking retains compatibility with standard scheduler propagation and value iteration routines, with Pareto caches only extending per-component data stores (Watanabe et al., 2024).
Attribute detection ComCa integrates with CLIP-, SigLIP-, and CoCa-based VLMs through API-level augmentation of the zero-shot scoring routine, imposing negligible (<1%) inference overhead (Garosi et al., 24 Mar 2025).
Vcache requires a server-side script fragmentor and a lightweight client operator library or proxy; standard HTTP caching semantics handle template consistency (Goyal et al., 2010).
ComCa in multiprocessors is implemented via OS-maintained task IDs, address-range buffering, and cache controller indexing remap tables, preserving process isolation with minimal runtime overhead (0710.4658).

5. Domain-Specific Limitations and Considerations

Several caveats and best practices are identified across domains:

Prefix precompute in IR only optimizes a single global prefix, not arbitrary shared subgraphs, and explicit cache wrappers must be placed accurately by the practitioner (MacAvaney et al., 14 Apr 2025).
Cache privatization in multiprocessors is bounded by associativity limits; context switches require coordination to ensure Source Buffer integrity (Balaji et al., 2017).
Attribute detection ComCa may suffer from retrieval corpus gaps or LLM compatibility hallucinations; best results are obtained by batch prompting and cross-validation with multiple LLMs (Garosi et al., 24 Mar 2025).
Vcache assumes print-style scripting and requires explicit server and client support; irregular control flow is less efficiently cached (Goyal et al., 2010).
Static partitioning in embedded systems must be recalibrated if task or buffer footprints change, though re-optimization is tractable for moderate system sizes (0710.4658).
ComCa in verification does not guarantee convergence in cyclic (loopy) diagrams and the efficiency of global stopping checks is instance-dependent (Watanabe et al., 2024).

6. Broader Significance and Applications

Compositional Caching underpins a class of scalable, reusable, and efficient computation frameworks:

Enables efficient long-context autoregressive LLM deployment under memory constraints, with minimal engineering overhead (Akulov et al., 5 Sep 2025).
Raises experiment throughput and modularity in large-scale IR and NLP evaluation environments (MacAvaney et al., 14 Apr 2025).
Paves the path for high-throughput, compositional, and power-efficient embedded multiprocessor deployments (0710.4658).
Facilitates scalable, sound model checking for modular or hierarchical MDP models (Watanabe et al., 2024).
Supports practical, training-free attribute detection across evolving vocabularies and data domains (Garosi et al., 24 Mar 2025).
Extends transparent caching to dynamic content generation for web systems (Goyal et al., 2010).
Unifies physical cache and software abstraction layers for enterprise-level data analytics and graph computation (Balaji et al., 2017).

Across these domains, ComCa demonstrates that structured, context- or component-aware caching is a principled strategy for mitigating computational and memory bottlenecks while preserving or enhancing overall system performance and predictability.