Papers
Topics
Authors
Recent
Search
2000 character limit reached

CaliDrop in Jet Physics & LLM Inference

Updated 3 July 2026
  • CaliDrop is a dual-purpose innovation that enhances QCD jet analysis by isolating soft emissions and improves transformer inference via calibrated KV cache compression.
  • In jet substructure, CollinearDrop selectively removes collinear radiation to reveal non-perturbative hadronization effects and validate angular ordering.
  • For LLMs, CaliDrop calibrates token eviction using query similarity, balancing memory savings with minimal end-task accuracy loss and throughput efficiency.

CaliDrop is a term that refers to distinct innovations in two fields: jet substructure physics (“CollinearDrop”) and efficient inference with transformer-based LLMs (“KV Cache Compression with Calibration”). The term’s usage diverges in context but converges on the central purpose of retaining essential information while discarding redundant or less relevant content—whether in QCD jets or neural network memory management.

1. CaliDrop in Jet Substructure: The CollinearDrop Observable

In high-energy nuclear physics, “CollinearDrop” (abbreviated “CaliDrop” in some contexts) denotes a jet substructure observable designed to enhance sensitivity to soft phase-space by selectively removing collinear radiation from jets while retaining soft emissions. The jet mass after this grooming provides insight into non-perturbative hadronization phenomena.

The construction proceeds as follows:

  • Ungroomed Jet Mass:

M=E2p2M = \sqrt{E^2 - |\vec{p}|^2 }

with sums over all jet constituents.

  • Grooming (SoftDrop Algorithm):
    • SoftDrop with parameters (zcut,β)=(0.1,0)(z_{\mathrm{cut}}, \beta) = (0.1, 0).
    • Iteratively declusters until

    zgmin(pT,1,pT,2)pT,1+pT,2>zcut(RgRjet)βz_g \equiv \frac{\min(p_{T,1}, p_{T,2})}{p_{T,1} + p_{T,2}} > z_{\mathrm{cut}} \left(\frac{R_g}{R_\mathrm{jet}}\right)^\beta

    where RgR_g is the angular distance between subjets.

  • Definition of CaliDrop Observable:

ΔM=MMg,ΔMM=MMgM\Delta M = M - M_g, \qquad \frac{\Delta M}{M} = \frac{M - M_g}{M}

and equivalently,

a=M2Mg2pT2a = \frac{M^2 - M_g^2}{p_T^2}

with MgM_g the SoftDrop-groomed jet mass.

CollinearDrop therefore measures the mass (or aa variable) removed by eliminating collinear—but not soft—radiation, offering enhanced access to wide-angle soft activity in jets (Song, 2023).

2. CaliDrop in LLM Inference: KV Cache Compression with Calibration

CaliDrop also denotes a strategy for reducing memory requirements during autoregressive decoding in LLMs via calibrated token eviction from the key-value (KV) cache. The KV cache stores past transformer hidden states, enabling O(n)O(n) attention. However, its memory usage grows linearly with sequence length, batch size, and model width, creating a bottleneck at long contexts.

Existing Approaches:

  • Quantization: Reduces KV precision (e.g., FlexGen, KIVI, QAQ).

  • Low-rank Projections: Compresses dimensionality (e.g., Palu).

  • Token Eviction: Discards less influential tokens (e.g., StreamingLLM, H2O, SnapKV).

Token eviction, the most direct, suffers substantial end-task accuracy loss at high compression ratios. CaliDrop was developed to calibrate for missing information and thereby recover lost performance (Su et al., 26 Jul 2025).

3. Algorithmic Description of CaliDrop for KV Compression

CaliDrop wraps around any token eviction scheme and leverages the empirical similarity of consecutive transformer queries:

  • Empirical Finding:

For queries Qt1,QtQ_{t-1}, Q_t at nearby steps, cosine similarity exceeds 0.85 for (zcut,β)=(0.1,0)(z_{\mathrm{cut}}, \beta) = (0.1, 0)0 (LLaMA-3-8B, LongBench). This suggests the attention output computed with the most recent available query can proxy that of subsequent queries over the evicted tokens.

  • Prefill Phase:

    • Calculate (zcut,β)=(0.1,0)(z_{\mathrm{cut}}, \beta) = (0.1, 0)1 for all tokens.
    • Select (zcut,β)=(0.1,0)(z_{\mathrm{cut}}, \beta) = (0.1, 0)2 (to be retained in fast cache) and (zcut,β)=(0.1,0)(z_{\mathrm{cut}}, \beta) = (0.1, 0)3 (to be evicted) via the user’s favorite eviction function (zcut,β)=(0.1,0)(z_{\mathrm{cut}}, \beta) = (0.1, 0)4.
    • Offload (zcut,β)=(0.1,0)(z_{\mathrm{cut}}, \beta) = (0.1, 0)5.
    • For the last prefill query (zcut,β)=(0.1,0)(z_{\mathrm{cut}}, \beta) = (0.1, 0)6, compute:

    (zcut,β)=(0.1,0)(z_{\mathrm{cut}}, \beta) = (0.1, 0)7

    (zcut,β)=(0.1,0)(z_{\mathrm{cut}}, \beta) = (0.1, 0)8 - Store (zcut,β)=(0.1,0)(z_{\mathrm{cut}}, \beta) = (0.1, 0)9.

  • Decode Phase:

    • zgmin(pT,1,pT,2)pT,1+pT,2>zcut(RgRjet)βz_g \equiv \frac{\min(p_{T,1}, p_{T,2})}{p_{T,1} + p_{T,2}} > z_{\mathrm{cut}} \left(\frac{R_g}{R_\mathrm{jet}}\right)^\beta1
    • Cosine similarity zgmin(pT,1,pT,2)pT,1+pT,2>zcut(RgRjet)βz_g \equiv \frac{\min(p_{T,1}, p_{T,2})}{p_{T,1} + p_{T,2}} > z_{\mathrm{cut}} \left(\frac{R_g}{R_\mathrm{jet}}\right)^\beta2
    • zgmin(pT,1,pT,2)pT,1+pT,2>zcut(RgRjet)βz_g \equiv \frac{\min(p_{T,1}, p_{T,2})}{p_{T,1} + p_{T,2}} > z_{\mathrm{cut}} \left(\frac{R_g}{R_\mathrm{jet}}\right)^\beta3: reload and recompute calibration.
    • zgmin(pT,1,pT,2)pT,1+pT,2>zcut(RgRjet)βz_g \equiv \frac{\min(p_{T,1}, p_{T,2})}{p_{T,1} + p_{T,2}} > z_{\mathrm{cut}} \left(\frac{R_g}{R_\mathrm{jet}}\right)^\beta4: merge old zgmin(pT,1,pT,2)pT,1+pT,2>zcut(RgRjet)βz_g \equiv \frac{\min(p_{T,1}, p_{T,2})}{p_{T,1} + p_{T,2}} > z_{\mathrm{cut}} \left(\frac{R_g}{R_\mathrm{jet}}\right)^\beta5.
    • otherwise: skip calibration.
    • Merge via:

    zgmin(pT,1,pT,2)pT,1+pT,2>zcut(RgRjet)βz_g \equiv \frac{\min(p_{T,1}, p_{T,2})}{p_{T,1} + p_{T,2}} > z_{\mathrm{cut}} \left(\frac{R_g}{R_\mathrm{jet}}\right)^\beta6

    zgmin(pT,1,pT,2)pT,1+pT,2>zcut(RgRjet)βz_g \equiv \frac{\min(p_{T,1}, p_{T,2})}{p_{T,1} + p_{T,2}} > z_{\mathrm{cut}} \left(\frac{R_g}{R_\mathrm{jet}}\right)^\beta7

Intuitively, calibration amounts to blending in a precomputed “memory” of the attention over evicted tokens, weighted by the degree of query similarity (Su et al., 26 Jul 2025).

4. Practical Integration and Overhead

  • Integration:

Only requires hooks in the prefill and decode logic, with no retraining or model modifications required.

  • Memory Savings:

Equal to the baseline token eviction scheme; only a selected subset of tokens zgmin(pT,1,pT,2)pT,1+pT,2>zcut(RgRjet)βz_g \equiv \frac{\min(p_{T,1}, p_{T,2})}{p_{T,1} + p_{T,2}} > z_{\mathrm{cut}} \left(\frac{R_g}{R_\mathrm{jet}}\right)^\beta8 is retained in the fast KV cache.

  • Additional Compute:

    • Prefill: one extra attention pass over evicted tokens.
    • Decode: periodic recomputation as dictated by the cosine threshold (zgmin(pT,1,pT,2)pT,1+pT,2>zcut(RgRjet)βz_g \equiv \frac{\min(p_{T,1}, p_{T,2})}{p_{T,1} + p_{T,2}} > z_{\mathrm{cut}} \left(\frac{R_g}{R_\mathrm{jet}}\right)^\beta91/8 steps at RgR_g0), with vector mixing otherwise.
  • Throughput:

With SnapKV as the eviction baseline on LLaMA-3-8B (A100, 1024 in/128 out, KV=128), SnapKV accelerates FullKV by RgR_g11.4RgR_g2, and SnapKV+CaliDrop is only RgR_g35% slower than SnapKV.

5. Empirical Results

Experimental validation encompasses several LLMs and long-context benchmarks:

Baseline Model KV Budget Score +CaliDrop
SnapKV Mistral-7B 64 33.79 37.90 +4.1
H2O Mistral-7B 64 33.81 37.81 +4.0
SLM Mistral-7B 64 28.64 33.62 +5.0
SnapKV LLaMA-3-8B (RULER) 64 14.95% 23.32% +8.4pp
  • Needle-in-a-Haystack: At 8K/32K context with LLaMA-3-8B @ KV=128, SnapKV collapses for deep retrieval while CaliDrop nearly recovers full recall.
  • Performance Trends:

Benefits peak under highly compressed KV budgets (64–256 tokens). As the retained KV approaches full (≥1024 tokens), additional gains become marginal.

6. Parameterization and Guidelines

  • Ideal Usage:

Applied when token eviction is necessary to fit tight GPU memory budgets.

  • Calibration Thresholds:
    • RgR_g4 (reload): traded off between throughput and accuracy, typical range 0.6–0.8.
    • RgR_g5 (mixing): governs calibration intensity, values ≳0.85 are effective.
  • Trade-offs:

Higher calibration rates (more frequent recomputation) yield better accuracy at mild computational cost; diminishing cache size increases both baseline degradation and the magnitude of the CaliDrop gain.

7. Significance and Outlook

In jet substructure, CollinearDrop provides a powerful diagnostic of early soft radiation and non-perturbative activity, with direct experimental measurements at STAR showing anti-correlation between grooming and hard splitting scale, validating angular ordering predictions and providing tests for parton-shower modeling (Song, 2023). In deep learning, CaliDrop for KV cache compression effectively reclaims substantial end-task performance lost to aggressive token eviction, with trivial integration overhead and no changes to model training, as demonstrated across LLM architectures and long-context benchmarks (Su et al., 26 Jul 2025).

A plausible implication is that the CaliDrop concept—exploiting redundancy or similarity in sequential representations—may generalize to other memory bottlenecks in sequential inference, and its empirical success motivates further research into hybrid calibration and cache management schemes across domains.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CaliDrop.