Papers
Topics
Authors
Recent
2000 character limit reached

Frequency Estimation via Linear Sketches

Updated 14 November 2025
  • Frequency estimation linear sketches are compact, randomized data structures designed to estimate element frequencies in high-volume data streams with sublinear space complexity.
  • They utilize a fixed sensing matrix and randomization to enable mergeability and low-latency updates, supporting applications from data streaming to privacy analysis.
  • Recent advances, including unified Lévy-process frameworks and learning-augmented methods, improve accuracy, fairness, and robustness in distributed and adversarial settings.

Frequency Estimation Linear Sketches

Frequency estimation linear sketches are a class of compact randomized data structures for approximating, with strong guarantees, the frequency (i.e., the number of occurrences) of elements in high-volume data streams. They achieve sublinear space complexity and support low-latency updates and point queries. The theory of frequency estimation sketches extensively connects data streaming, randomized linear algebra, probability theory (especially infinite divisibility), Bayesian statistics, and privacy-preserving data analysis.

1. Linear Sketching Model and Basic Frequency Estimation

A frequency estimation linear sketch typically maintains a state vector SRmS \in \mathbb{R}^m, updated as SS+A:,iΔS \leftarrow S + A_{:,i} \Delta on arrival of an update (i,Δ)(i, \Delta) to an underlying frequency vector xRnx \in \mathbb{R}^n. Here, ARm×nA \in \mathbb{R}^{m \times n} (or Zm×n\mathbb{Z}^{m \times n}) denotes a fixed, typically randomized, "sensing" matrix. This "linear sketching" property ensures that the data structure is mergeable and compatible with the streaming (turnstile) model.

For point queries, the sketch is decoded (possibly using non-linear post-processing) to estimate xix_i, with error analyzed in worst-case, average-case, or weighted metrics. Popular primitives include:

  • Count-Min Sketch: Linear sketch with m=d×wm = d \times w counters, updates via dd hash functions; guarantees x~xεx1\|\tilde{x} - x\|_\infty \leq \varepsilon \|x\|_1 and no underestimation but only for insert-only streams (Li, 2019, Aamand et al., 2023).
  • Count Sketch: Uses random sign hashes to allow two-sided 2\ell_2-error bounds: x~xεx2\|\tilde{x} - x\|_\infty \leq \varepsilon \|x\|_2 (Aamand et al., 2023, Indyk et al., 2021).
  • Generalized sketches: Extensions such as the trapezoidal sketch allow for asymmetric or data-profile-driven counter allocation to optimize space (Li, 2019).

Composite structures exploit a hash-based "heavy filter" or use machine learning-based recovery to improve bias, robustness, and fairness (Yuan et al., 4 Dec 2024, Shahbazi et al., 25 May 2025).

2. Space Complexity and Lower Bounds for Frequency Moment Sketching

Estimating frequency moments (Fp(x)=i=1nxipF_p(x) = \sum_{i=1}^n |x_i|^p) using linear sketches is foundational for understanding the capabilities and limitations of sketching.

  • For p[1,2]p \in [1,2], the classical lower bound due to Alon–Matias–Szegedy and its tightness were established; any (1±ε)(1\pm\varepsilon)-approximation requires m=Ω(ε2log(1/δ))m = \Omega(\varepsilon^{-2} \log(1/\delta)) rows (Gribelyuk et al., 25 Mar 2025, Andoni et al., 2013).
  • For p>2p > 2, Andoni et al. (Andoni et al., 2013) and follow-up work (Gribelyuk et al., 25 Mar 2025) proved that linear sketches must use m=Θ(n12/plogn)m = \Theta(n^{1-2/p} \log n) rows for constant-approximation, which is both necessary and sufficient.
  • The lifting technique of (Gribelyuk et al., 25 Mar 2025) uses lattice smoothing and discrete-to-continuous argumentation to extend continuous lower bounds to integer-valued settings (with both sketch entries and input vectors bounded in {±poly(n)}\{\pm \mathrm{poly}(n)\}), resolving longstanding open questions in the space complexity of discrete frequency-moment sketching and showing that no adversarially robust linear sketch of o(n)o(n) dimension exists even for p[1,2]p \in [1,2] in adaptive streaming models.

Space complexity for more advanced statistics (e.g., top-kk or trimmed FpF_p moments as in (Lin et al., 9 Jun 2025)) is characterized by gap conditions on the input vector; for sufficiently separated heavy hitters, polylogarithmic-space sketches suffice.

3. Unified Lévy-Process Framework for Estimating General Frequency Functionals

A major recent advance is the characterization of which frequency functionals admit sublinear-space estimation via linear sketches through the Lévy–Khintchine representation (Pettie et al., 22 Oct 2024):

  • Central result: Any linear sketch whose register-sum converges (in law) to an infinitely divisible distribution can be interpreted as a sum over independent copies of a chosen multivariate Lévy process.
  • The set of ff-moments F=vf(x(v))F = \sum_v f(x(v)) that can be estimated by such schemes is precisely the set of characteristic exponents admissible under the Lévy–Khintchine theorem:

fX(u)=ibu12uTAu+Rd{0}(eiuz1iuz1z<1)ν(dz)f_X(u) = i\,b \cdot u - \frac{1}{2}u^T A u + \int_{\mathbb{R}^d \setminus \{0\}} \left(e^{i u \cdot z} - 1 - i u \cdot z 1_{\|z\| < 1}\right) \nu(dz)

  • Classical sketches (AMS/F2F_2, Indyk's LαL_\alpha-sketch, HyperLogLog, etc.) are all recoverable as specific choices of the underlying Lévy process (e.g., Brownian motion, α\alpha-stable, compound Poisson).
  • Nearly periodic and previously unclassified functions (e.g., gnp(x)=2τ(x)g_{np}(x) = 2^{-\tau(x)}) are shown to be tractable by explicit Fourier analysis and decomposition into Lévy exponents.
  • Tractability: Every ff with such representation can be approximated to O(ε)O(\varepsilon) accuracy with O(ε2logn)O(\varepsilon^{-2} \log n) space; functions not admitting this representation are provably intractable except at near-linear space (Pettie et al., 22 Oct 2024).

This complete characterization provides a mechanical recipe: given any function ff of interest, writing ff as a Lévy exponent (or difference of exponents via Fourier–Hahn decomposition) immediately yields a sketching algorithm.

4. Fairness, Robustness, and Privacy in Frequency Sketching

Recent work addresses skew and group unfairness as well as adversarial and privacy threats.

Fairness

Standard Count-Min introduces additive error εf1\varepsilon \|f\|_1 that disproportionately affects low-frequency elements (Shahbazi et al., 25 May 2025). The Fair-Count-Min sketch mitigates this by group-aware bucket allocation: partitioning the column space so that each group gets its own contiguous set of buckets and hash functions, sized to ensure equal expected multiplicative error across groups. This construction provides provable fairness without significant space or update overhead.

Adversarial robustness

Adaptive adversaries can efficiently attack any o(n)o(n)-dimension linear sketch for LpL_p-estimation in the turnstile streaming model (Gribelyuk et al., 25 Mar 2025). This result, leveraging lifting from continuous to discrete Gaussians and exploiting conditional expectation attacks, shows that there exists no adversarially robust, dimension-reduced linear sketch for general LpL_p tasks in this setting.

Privacy

PrivSketch (Li et al., 2023) provides a protocol for frequency estimation under local differential privacy using a Count-Min-style linear sketch, enhanced with per-user randomized response and an auxiliary ordering matrix. The protocol achieves unbiasedness, optimal error–privacy trade-offs, and strong empirical performance using only O(KM)O(KM) local memory and O(1)O(1) communication per user.

5. Learning-Augmented and Bayesian Frequency Estimation

There has been significant recent interest in blending sketches with learning-based or Bayesian inference for frequency estimation:

  • Learning-Augmented Sketches: UCL-sketch (Yuan et al., 4 Dec 2024) uses online training, without access to ground-truth frequencies, to learn a neural recovery map that minimizes a consistency + sparsity + equivariance loss over the observed sketch state. This approach achieves 6–20×\times lower error than Count-Min/Count, at comparable speed and memory.
  • Learning with Predictions: Heavy-hitter-aware algorithms (Aamand et al., 2023) and composable sketches with advice (Cohen et al., 2020) leverage predictions about heavy elements to allocate sketch resources more effectively, improving weighted error rates up to the information-theoretic optimum on Zipfian streams.
  • Bayesian Frequency Inference: Poisson-Kingman, Dirichlet, and Pitman–Yor process priors provide posterior distributions for individual frequencies and cardinalities (Beraha et al., 2023). Conformal approaches (Sesia et al., 2022) yield finite-sample, exact, distribution-free confidence intervals for frequencies using only the observed sketch and weak exchangeability assumptions.
  • Smoothed Bayesian Estimation: Smoothed-Bayesian and multi-view (multi-hash) estimators (Beraha et al., 2023) deliver (i) linear estimators optimal in conditional MSE, (ii) fully unbiased shrinkage estimators under DP and generalized gamma process smoothing, (iii) product-of-experts and min-aggregation schemes for multi-hash sketches, and (iv) practical procedures for hyperparameter learning and interval calibration.

6. Trimming, Robust, and Composable Frequency Functionals

Extending beyond point and moment estimation, linear sketches support robust aggregations:

  • Trimmed Statistics: Sublinear-space sketches for top-kk FpF_p moments, kk-trimmed sums, and sum-above-threshold queries are possible for p[0,2]p\in[0,2] if the gap condition (sufficient separation between top kk frequencies and the tail) is met (Lin et al., 9 Jun 2025). The main technique is multilevel subsampling plus heavy-hitter detection with Count-Sketch.
  • Composable Sketches: Sampling-based sketches (priority sampling, bottom-kk, p\ell_p-sampling) support mergeability and, under realistic data distributions (Zipf, heavy-tail), can accurately estimate "hard" statistics (threshold aggregates, high-moment sums) with only moderate overhead (Cohen et al., 2020). This stands in contrast to worst-case lower bounds, demonstrating strong empirical performance of small sketches beyond their original design regimes.

7. Distributed, Recoverable, and System-Level Considerations

Real-world deployments require resilience and composability:

  • Recoverable Sketches: The recoverable sketch framework (Cohen et al., 7 Nov 2025) ensures that in a distributed network, any node’s sketch (e.g., CMS or Count-Sketch) is recoverable after a crash through modular API primitives (full and delta serialization) and incremental checkpointing, bounding recovery latency and communication overhead independently of stream volume.
  • Composable Mergeability: All mainstream linear sketches are trivially composable via addition, which is exploited in both distributed and federated learning/data analytics scenarios, as well as in decentralized privacy-preserving applications.

This landscape summarizes the state of the art in frequency estimation linear sketches as of 2025, covering theory, algorithms, adaptive and learning-augmented methods, fairness/privacy guarantees, robustness, and practical system design. The unifying perspective afforded by the Lévy–Khintchine theorem and the recent rigorous quantification of lower and upper bounds provide a comprehensive toolkit for both theoreticians and system architects.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Frequency Estimation Linear Sketches.