Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 36 tok/s

GPT-5 High 40 tok/s Pro

GPT-4o 99 tok/s

GPT OSS 120B 461 tok/s Pro

Kimi K2 191 tok/s Pro

2000 character limit reached

Weighted Reservoir Sampling

Updated 30 June 2025

Weighted Reservoir Sampling is a probabilistic algorithm that selects a fixed-size, weighted subset from unbounded data streams while ensuring unbiased and low-variance results.
It employs methods like key-based priority sampling and the variance-optimal varopt_k algorithm, which efficiently manage streaming and distributed data.
This approach is widely applied in data streaming, graph analytics, and model management to address storage constraints while preserving key statistical properties.

Weighted Reservoir Sampling (WRS) is a probabilistic algorithmic framework for maintaining a fixed-size, randomly selected subset (reservoir) from a potentially unbounded or high-volume stream of weighted items. In WRS, inclusion probabilities or selection priorities reflect user-specified, positive weights supplied per item. The method is critical in data streaming, scientific sampling, database summarization, graph analytics, and distributed systems where storage and computational constraints preclude retaining all data while unbiased and low-variance statistical inference is required.

1. Mathematical Foundations and Problem Definitions

Weighted Reservoir Sampling generalizes the classic reservoir sampling problem ( $k$ -sampling over a stream of unweighted items), aiming to select $k$ items from a population with positive weights $\{w_i>0\}_{i=1}^n$ so that the probability any particular sample (or subset, or permutation—depending on the variant) accurately reflects these weights according to prescribed selection semantics.

The field recognizes two principal problem formulations over data streams:

WRS-N-P: No replacement, weights prescribing inclusion probabilities; specifically, $P[\text{item } i \text{ in sample}] \propto w_i$ .
WRS-N-W: No replacement, weights as sequential selection probabilities; in each round, the next sample is drawn from the unselected items with probability proportional to weight.

With-replacement variants (WRS-R) degenerate to the i.i.d. case—each sample slot is selected independently, and the difference between inclusion and sequential selection collapses.

2. Core Algorithms and Implementation Strategies

Two primary classes of algorithms are employed:

A. Key-Based Priority Sampling

This approach (Efraimidis–Spirakis, 2006) extends classic priority sampling via assignment of random keys to each item. For each item $i$ , a key is generated as

$k_i = u_i^{1/w_i}, \quad u_i \sim \text{Uniform}(0,1)$

or equivalently,

$v_i = -\ln(u_i)/w_i$

The $k$ items with highest (or $v$ with smallest) keys are selected. This method naturally supports streaming (with a min-heap per reservoir), efficient updates ( $O(\log k)$ per selected item), and parallelization (sorting key-value pairs).

B. Probability Proportional to Size Without Replacement ($\varopt_k$)

The $\varopt_k$ algorithm (0803.0473) provides a variance-optimal approach for subset sum estimation, maintaining for every sample size $k$ and stream length $n$ :

Unbiasedness: For any subset $I$ , estimator $\hat w_I = \sum_{i\in I}\hat w_i$ is unbiased for $w_I$ .
Variance-optimality: For every subset size $m\leq n$ , $\varopt_k$ minimizes the average subset sum variance $V_m$ among all $k$ -sample schemes.

Threshold-based inclusion probabilities are computed as $p_i = \min\{1, w_i/\tau\}$ where $\tau$ is set to ensure sample size $k$ .

The recursion property:

$\varopt_k\left(\bigcup_{j=1}^m I_j\right) = \varopt_k\left(\bigcup_{j=1}^m \varopt_{k_j}(I_j)\right)$

enables distributed and compositional sampling.

For each stream insertion, the algorithm adds the new item to the reservoir, then recomputes $\tau$ and resamples down to $k$ using appropriately derived probabilities. This ensures no positive covariances between adjusted weights and tight concentration bounds.

C. Classical Algorithms for Streaming

The A-Chao algorithm gives a simple $O(1)$ per-item approach for WRS-N-P (proportional inclusion), handling "overweight" items by capping inclusion probability at $1$. The A-ES method (priority/order sampling) supports WRS-N-W directly via the random key pipeline.

Skip/jump methods (e.g., (Meligrana, 29 Mar 2024)) further accelerate processing by sampling the position of the next update, reducing wasted effort in typical stream regimes.

3. Statistical Properties: Unbiasedness, Variance, and Concentration

Any sound WRS approach for subset sum queries must guarantee unbiasedness—by either direct estimation or Horvitz-Thompson reweighting—across arbitrary or query-specified subsets.

Variance Optimality: $\varopt_k$ minimizes the average variance:

$V_m = \frac{m}{n} \left( \frac{n-m}{n-1} + \frac{m-1}{n-1}V \right)$

for all $m$ , with $V$ the variance of the total estimate.

Tightness and Tail Bounds: Under weight-value monotonicity, Chernoff/Hoeffding-type upper-tail bounds for sums carry over from the with-replacement to the without-replacement setting (Ben-Hamou et al., 2016). The results extend to sub-Gaussian concentration for general weights.
Distributed Combination: Variance-optimal sample merges are possible by recursive application of $\varopt_k$, enabling reliable distributed reservoir design.

4. Distributed, Parallel, and Communication-Efficient WRS

In modern high-throughput systems, distributed streams and parallel processing are common; core research advances include:

Parallel WRS: Batched, fully distributed algorithms generate per-item keys locally, select $k$ smallest in global coordination steps, and maintain load balance and communication efficiency (Hübschle-Schneider et al., 2019, Hübschle-Schneider et al., 2019). Selection among global candidates uses parallel multisequence selection, minimizes per-PE work to $O(\frac{k}{p}\log\frac{n}{k})$ , and scales to thousands of nodes.
Distributed SWOR: For message-optimal weighted sampling without replacement (SWOR), exponentially weighted keys and level-set throttling yield minimal message and space complexity in the distributed stream setting (Jayaram et al., 2019), satisfying tight lower bounds.
Single-Pass Merging: Merge properties of core samplers (especially $\varopt_k$) enable hierarchical fusion in federated or edge settings (0803.0473, Meligrana, 29 Mar 2024).
Temporal/Decay Bias: Reservoir designs supporting explicit temporal bias (e.g., guaranteeing inclusion probabilities decay exponentially with age) balance sample freshness and realism (Hentschel et al., 2018).

5. Specialized WRS Applications: Graphs, Patterns, and Model Management

A. Graph Sampling and Counting

Graph Priority Sampling (GPS): Adaptively weights edge samples for motif counting (triangles, wedges) in massive streams, minimizing estimator variance by assigning sampling weights reflecting motif completion likelihood (Ahmed et al., 2017).
Triangle Counting: Hybrid schemes (e.g., Waiting Room Sampling (Shin, 2017)) combine "waiting rooms" capturing recent edges (with high triangle-formation probability) and standard reservoirs to optimize variance and bias in dynamic graphs.

B. Pattern Mining

Reservoir frameworks have been generalized for direct sampling of patterns (e.g., sequential and weighted itemsets), where normalization and candidate enumeration typically pose scalability bottlenecks. The RPS algorithm maintains probabilistically correct pattern reservoirs by combining batch-wise acceptance probabilities, binomial replacement counts (via the incomplete beta function), and efficient direct pattern sampling (Diop et al., 31 Oct 2024). Temporal damping, utility constraints, and norm criteria are supported.

C. Model and Ensemble Management

For online learning stability, particularly in passive-aggressive and sparse online algorithms, maintaining a WRS-based ensemble over intermediate solutions (scored by survival/persistence) can yield lower risk, improved consistency, and practical robustness compared to last-state or moving-average baselines (Wu et al., 31 Oct 2024).

6. Extensions, Robustness, and Theoretical Boundaries

Variance-Aware and Bayesian Extensions: Recent research advocates allocating reservoir sample effort in proportion not just to group sizes or weights, but also to empirical standard deviations (following Neyman allocation), further minimizing estimator variance (Liu, 28 Aug 2024).
With-Replacement and Efficient Skipping: WRS with replacement can be performed by running $k$ independent single-item samplers or via joint skip-based algorithms for higher compute and memory efficiency in the streaming setting (Meligrana, 29 Mar 2024).
Adversarial Robustness: In adversarial streaming, WRS is only robust to adaptive attacks if the sample size scales with the logarithm of the set system size $|\mathcal{R}|$ , not VC-dimension; this constraint is inherent and affects reliability in high-dimensional or competitive environments (Ben-Eliezer et al., 2019).
Sketched and Coordinated Sampling: For $\ell_p$ -sampling and sketches (e.g., CountSketch), composable bottom- $k$ transforms allow practical implementation of WOR (without replacement) sampling, including for signed/turnstile data streams (Cohen et al., 2020).

7. Comparative Analysis and Method Selection

Algorithm / Setting	Variance Optimal	Distributed Merge	Update Complexity	Decay/Temporal Bias	Application Domain
$\varopt_k$ (0803.0473)	Yes	Yes	$O(\log k)$	No	DB, analytics, streaming
A-Chao	Yes (WRS-N-P)	Yes	$O(1)$	No	Streaming
A-ES (Priority)	No (except $m=1$ )	Yes	$O(\log k)$	No	Weighted, Top- $k$
Skip-based WRSWR	N/A (WR)	Yes	$O(1)$ amortized	No	Bootstrapping, large $k$
R-TBS (Hentschel et al., 2018)	No	Yes	$O(1)$	Yes (exponential)	ML, drift/retraining
GPS (Ahmed et al., 2017)	Motif-specific	Partial	$O(\log k)$	No	Graph motif counting
RPS (Diop et al., 31 Oct 2024)	Pattern-utility	Yes	$O(\log k)$	Yes	Pattern mining, streams

Method selection depends on application requirements, variance priorities, streaming/distributed environment, temporal recency needs, and the specifics of the weight/model assignment.

8. Practical Considerations and Limitations

Weight imbalance and "overweight" items can lead to capping probabilities; special handling is required to ensure correct inclusion (as in A-Chao, (Efraimidis, 2010)).
Negative association is not guaranteed in all weighted schemes; variance analysis and confidence bounds can be nontrivial (Ben-Hamou et al., 2016).
Parameter tuning (e.g., waiting room size in temporal schemes, decay parameters) may be data-dependent and nontrivial.
Adversarial environments necessitate robust sample sizing and careful monitoring of the weight distribution for resilience (Ben-Eliezer et al., 2019).
Batch and distributed architectures require efficient synchronization and local thresholding for practical scalability (Hübschle-Schneider et al., 2019, Hübschle-Schneider et al., 2019).

References

Chao, M.-T. (1982). A general-purpose unequal probability sampling plan. Biometrika, 69(3), 653–656.
Cohen, E., Duffield, N., Kaplan, H., Lund, C., & Thorup, M. (2009). Stream sampling for variance-optimal estimation of subset sums. (0803.0473)
Efraimidis, P. S., & Spirakis, P. G. (2006). Weighted random sampling with a reservoir. Information Processing Letters, 97(5), 181–185.
Ben-Hamou, A., Peres, Y., & Salez, J. (2016). Weighted sampling without replacement. (Ben-Hamou et al., 2016)
Ahmed, N. K., Duffield, N., Willke, T. L., & Rossi, R. A. (2017). On Sampling from Massive Graph Streams. (Ahmed et al., 2017)
Beretta, F., & Tětek, J. (2021). Better Sum Estimation via Weighted Sampling. (Beretta et al., 2021)

Weighted Reservoir Sampling thus constitutes a foundational, rigorously-analyzed, and practically indispensable family of algorithms, adaptively tailored to diverse modern data analysis and scientific computing scenarios.