Infinity Cache: Scalable Invalidation & Caching
- Infinity Cache is a dual-methodology framework, applying generational invalidation for consistent query caching and scale-aware KV cache management in visual autoregressive models.
- It employs atomic, vectorized revision counters—updating up to 2^k counters per write—to ensure precise, concurrency-safe cache invalidation in database-backed systems.
- In VAR models, the ScaleKV strategy adaptively prunes cached keys/values using attention metrics, reducing memory usage by up to 10× while preserving inference quality.
Infinity Cache is a term encompassing two distinct high-performance caching methodologies: (a) a generational invalidation scheme for query results in relational database-backed web applications, which supports infinite time-to-live (TTL) and precise cache invalidation (Łopuszański, 2023); and (b) a cache management strategy for segmenting key-value (KV) cache in multi-scale visual autoregressive (VAR) models, addressing exponential memory growth through scale-aware, attention-adaptive pruning (Li et al., 26 May 2025). Both approaches are notable for their principled handling of invalidation and resource constraints in large-scale read-intensive systems.
1. Generational Invalidation for Query Caching
The Infinity Cache for database-backed web applications introduces a generational invalidation strategy, leveraging vectorized revision counters in a global cache. Each cache entry is tagged with a vector of "revision" integers, each associable with a specific query subspace. Upon each database write (insert/delete), these revision numbers are incremented according to all possible wildcardings of the query's k columns, leading to fine-grained and concurrency-safe invalidation.
A tripartite revision dependency graph formalizes the relations between write queries, pattern nodes (wildcard/query mask patterns), and read queries. Every pattern node is tagged with a revision integer. Invalidation after a write is performed by atomically incrementing 2k global counters corresponding to all relevant pattern variants. Query results are validated by comparing the persisted revision vector against the current values obtained in one batched fetch.
This scheme guarantees that no cache hit can ever return a result older than the latest possible write affecting the query, modulo a bounded delay ε required to perform invalidation (Łopuszański, 2023).
2. Cache and Query Model Formalism
Let S denote the k-dimensional product of table column domains. A record r is a point r ∈ S, and a query q is a k-tuple over S ∪ {∗}, with ∗ as a wildcard. The subspace of a query, subspace(q), is defined as all v ∈ S where for all i, either q[i] = v[i] or q[i] = ∗.
Operations:
- select(q): retrieves entries in subspace(q)
- insert(r): adds a point r
- delete(q): removes records in subspace(q)
The system is typically organized around two cache levels:
- globalCache: shared, used for revision counters and optionally holding full results
- localCache: per-node, for rapid repeated lookups
The critical data structure—the revision counters per pattern node—grows at worst as O(2k·|distinct parameter values|), but is typically much smaller in realistic workloads via optimizations such as dimension pruning and graph trimming.
3. Algorithmic and Correctness Properties
Infinity Cache ensures correctness through monotonicity and atomicity of revision counters:
- Each successful increment strictly increases the relevant revision.
- On select(q) at time t, the observed revision vector is greater than or equal to the maximal version ever valid for q, ensuring no underestimation.
- Any invalidate(q_w) that intersects a read query will increment at least one pattern node before any subsequent select query is performed.
Formally, if select(q) executes at time t, it returns the database’s result as of some time t' ≥ t–ε, where ε is bounded by the duration of the invalidate routine (Łopuszański, 2023).
For a k-column table, each invalidate operates on exactly 2k revision counters, and each select performs at most 2k global cache accesses (amortized into a single bulk call). Local cache accesses, when enabled, are O(1).
4. Practical Optimizations and Deployment
Several optimizations are undertaken for domain-specific workloads:
- Dimension pruning: restrict revision tracking to columns actually appearing in predicates.
- Graph trimming: restrict tracking to observed access/write patterns, reducing unnecessary revision nodes.
- CRUD-only workloads: for pure point operations, only exact pattern nodes are used.
- Range queries: via binary decomposition, reduce superlinear counter growth by merging at tree nodes.
- Persistent storage: durable key/value stores (e.g., Redis with append-only log) are used for revision counters, with explicit pattern white-listing to avoid unbounded key growth.
Empirically, the system, deployed in a PHP + MySQL + Memcached stack for half a million users, exhibited zero correctness incidents and maintained high cache hit ratios (up to 97%) even under mixed read/write stress (Łopuszański, 2023). The worst-case stale hit was bounded by ε ≤ 0.4 s, with a median of ~0.03 s.
5. Infinity Cache in Visual Autoregressive Models
For high-dimensional VAR models (e.g., the Infinity text-to-image architecture), Infinity Cache refers to an efficient KV cache compression protocol named ScaleKV (Li et al., 26 May 2025). The Infinity model generates images via progressive refinement across scales (token maps of increasing resolution), retaining key-value pairs for all tokens at prior scales in a cache that otherwise grows exponentially due to the geometric scaling of token counts.
Through detailed attention-map analysis, transformer layers are assigned to either "drafters" (layers with global, cross-scale attention, needing large cache) or "refiners" (layers focusing on the current scale’s tokens, needing only local cache). ScaleKV introduces the Attention Selectivity Index (ASI) to automate this classification. Layer-scale pairs with lowest ASI (by Z-score normalization) are designated as drafters; roughly 30–40% of layers fall in this group after calibration on a small prompt set.
Cache budgets (number of preserved KV tokens per layer per scale) are then adaptively allocated: drafter budgets are kept high, whereas refiner budgets decay linearly with scale. A token selection and pruning operator, combining observation-window centroids and softmax attention-based importance ranking, prunes cached keys/values per-layer to meet assigned budgets.
6. Empirical Results and Quantitative Analysis
On the Infinity-8B VAR model, ScaleKV reduces KV cache memory usage by 10× (from 84.3 GB to 8.53 GB at 10% budget) with minimal impact on sample quality metrics: at 20% budget, FID=1.45, LPIPS=0.06, and PSNR=25.60 (full cache baseline metrics not degraded significantly). Even at 10% budget, performance remains competitive (FID=2.12, LPIPS=0.09, PSNR=23.25).
Speed improvements are significant: up to 1.25× inference speedup on a single NVIDIA H20 GPU, and batch size capacity increases from OOM (>100 GB) to feasible usage (49 GB for batch size 8). Semantic quality, as measured by GenEval and DPG benchmarks, remains virtually unchanged at moderate cache budgets (Li et al., 26 May 2025).
| Model | Full KV Cache (GB) | 10% ScaleKV (GB) | FID (10%) | PSNR (10%) |
|---|---|---|---|---|
| Infinity-8B | 84.3 | 8.53 | 2.12 | 23.25 |
| Infinity-2B | 38.6 | 3.90 | 2.12 | 23.25 |
7. Implications, Extensions, and Limitations
Infinity Cache, both as a query invalidation algorithm and as a memory-efficient mechanism for hierarchical autoregressive modeling, enables scalable operation in resource-constrained environments without compromising correctness or predictive fidelity. In database settings, it strictly dominates naive short-TTL or full-flush caching policies, supporting infinite-TTL with fine-grained, concurrent-safe invalidation (Łopuszański, 2023). In vision models, scale-aware cache pruning based on attention analysis and per-layer budget optimization offers a principled path to supporting high-resolution inference at practical memory consumption (Li et al., 26 May 2025).
Limitations in the VAR setting include the need for post-hoc budget tuning and the current restriction to token-level pruning. Possible future work includes integrated quantization, head-level pruning, and adaptive, runtime budget reallocation. For the database variant, space overhead remains a function of unique access patterns and parameter cardinality; white-listing and pattern analysis mitigate potential unbounded growth.