Papers
Topics
Authors
Recent
Search
2000 character limit reached

Probability Proportional to Size Without Replacement

Updated 4 May 2026
  • PPSWOR is a statistical sampling method that selects units without replacement, with selection probability proportional to user-defined weights.
  • It employs algorithms like rejective and successive sampling to ensure proportional inclusion probabilities and induce negative correlations among units.
  • The method is pivotal in survey design, randomized experiments, and streaming data applications for efficient variance reduction and scalable computation.

Probability Proportional to Size Without Replacement (PPSWOR) is a fundamental statistical sampling scheme wherein units are drawn sequentially from a finite population without replacement, and at each draw, the probability of selection is proportional to a user-defined “size” or weight associated with each unit. PPSWOR is central to a spectrum of applications, including complex survey design, randomized experiments, distributed data summarization, scalable learning algorithms, and theoretical analyses of dependent selection processes.

1. Formal Definitions, Probabilistic Structure, and Canonical Algorithms

Given a population U={1,2,,N}U = \{1, 2, \dots, N\} with non-negative weights w1,,wNw_1, \dots, w_N, PPSWOR refers to sampling a subset SUS \subset U of fixed size nn such that, at each selection step, an element ii is chosen with probability wiw_i divided by the sum of weights of remaining (unselected) elements. Formally, if Rk1R_{k-1} denotes the set of as-yet-unsampled units at step kk, the conditional probability of selecting iRk1i \in R_{k-1} is: Pr{jk=iRk1}=wihRk1wh\Pr\{j_k = i \mid R_{k-1} \} = \frac{w_i}{\sum_{h \in R_{k-1}} w_h} Classic PPSWOR schemes include:

  • Hájek’s rejective (conditional Poisson) sampling: Equivalent to repeatedly drawing Bernoulli indicators with Poissonized marginal probabilities, then conditioning on the sample having fixed size; yields computationally tractable expressions for inclusion probabilities via elementary symmetric polynomials (Yu, 2010).
  • Successive (sequential) PPSWOR: Sampling with replacement, discarding repeats, until w1,,wNw_1, \dots, w_N0 distinct items are drawn; admits an explicit albeit complex sum-of-products representation for inclusion probabilities.
  • Streaming and algorithmic variants: Schemes such as the EB-PPS algorithm provide exact PPSWOR (i.e., sticking to prescribed proportionality in marginal inclusion probabilities) while controlling sample size in a single pass over weighted streams (Hentschel et al., 2021).

For any unit w1,,wNw_1, \dots, w_N1, the first-order inclusion probability under PPSWOR with sample size w1,,wNw_1, \dots, w_N2 and weights w1,,wNw_1, \dots, w_N3 (normalized, so w1,,wNw_1, \dots, w_N4) is:

  • Rejective: w1,,wNw_1, \dots, w_N5, where w1,,wNw_1, \dots, w_N6 is the w1,,wNw_1, \dots, w_N7th elementary symmetric sum.
  • Successive: w1,,wNw_1, \dots, w_N8, with w1,,wNw_1, \dots, w_N9 the intricate sum detailed in (Yu, 2010).

2. Key Theoretical Properties: Uniformity, Proportionality, and Majorization

PPSWOR methods, in contrast to with-replacement sampling, induce dependencies and negative correlations among the inclusion indicators of different units. The inclusion probabilities inherit important structural properties:

  • Majorization order: For both rejective and successive PPSWOR, as SUS \subset U0 increases, the per-sample inclusion vector SUS \subset U1 becomes more uniform, i.e., majorizes those at smaller SUS \subset U2. For fixed SUS \subset U3 and draw probabilities SUS \subset U4, the inclusion probabilities produced by rejective sampling are always more uniform (in majorization sense) than those from successive sampling, confirming the conjecture of Hájek (Yu, 2010).
  • Kullback-Leibler divergence: Inclusion probabilities from successive PPSWOR are always closer (in KL divergence) to the original drawing probabilities SUS \subset U5 than those from rejective PPSWOR. That is, SUS \subset U6 is more “proportional” to SUS \subset U7, while SUS \subset U8 is more “uniform” (Yu, 2010).
  • Negative correlation: In any PPSWOR scheme, inclusion indicators for distinct units exhibit negative dependence; SUS \subset U9 (Huang et al., 2024).

These facts have critical implications for variance reduction, estimator design, and the analysis of sampling-based algorithms.

3. Algorithms, Computational Complexity, and Streaming Schemes

Computing inclusion probabilities in PPSWOR depends on the chosen variant and the population size.

  • Rejective sampling: Efficient dynamic programming via recursion on symmetric polynomials computes all inclusion probabilities in nn0 time (Yu, 2010).
  • Successive sampling: The sum-of-products formula requires potentially exponential time; practical methods include Monte Carlo simulation, sandwiching by majorization bounds, or iterative refinement (Yu, 2010).
  • Streaming EB-PPS: The EB-PPS algorithm maintains a “latent sample” representation, updating inclusion probabilities and sample content in nn1 amortized time per item. At each step, proportionality factor nn2 is updated as nn3, and inclusion probabilities are set as nn4 (Hentschel et al., 2021).

Modern applications employ sketch-based algorithms, such as bottom-nn5 and nn6-residual heavy hitter sketches, to perform PPSWOR efficiently in high-dimensional streaming or distributed environments (Cohen et al., 2020).

Algorithm Type Input Model Complexity
Rejective (Hájek) Static nn7
Successive (sequential) Static nn8 (DP); MC feasible for large N
EB-PPS (streaming) Streaming nn9 amortized per item

4. Statistical Inference: Estimation, Variance, and Design-based/Bayesian Methods

PPSWOR underpins the design of unbiased estimators in finite population sampling, most notably the Horvitz-Thompson (HT) estimator. For population mean or total estimation:

  • HT estimator: If ii0 is the inclusion probability of unit ii1, the estimator is ii2.
  • Variance: The variance involves both first- and second-order inclusion probabilities ii3. For example, in cluster-randomized experiments, closed forms for the variance are available under PPSWOR, and conservative unbiased estimators (Sen-Yates-Grundy) can be employed even when joint probabilities are approximated (Xiong et al., 2020).
  • Bayesian model-based inference: For two-stage cluster sampling by PPSWOR, Bayesian frameworks account for uncertainty in cluster size (for units not sampled), providing integrated, design-aware posterior inference with efficiency gains over classical approaches (Makela et al., 2017).

Practical computation of joint inclusion probabilities and variances can rely on analytical approximations (e.g., Hájek’s, Berger-Sitter) or simulation; specialized software, such as the R package TeachingSampling, supports standard PPSWOR algorithms including Sunter’s method (Xiong et al., 2020).

5. Large Deviations, Concentration, and New Theoretical Bounds

Recent advances provide sharp deviation inequalities for PPSWOR:

  • Martingale and pivotal methods: The Deville–Tillé pivotal method generates the fixed-size sample as a vector-martingale, providing the basis for Freedman (Bennett-type) (Foster et al., 2024), Chernoff-type (Kullback-Leibler), and Hoeffding-Azuma concentration bounds for the number of sampled units falling in any subset ii4 of the population:

ii5

for subset weight ii6, variance parameter ii7, and sample size ii8 (Foster et al., 2024). These bounds provide quantitative guarantees for the concentration of PPSWOR samples and adapt classical hypergeometric tail estimates to the unequal-probability setting.

  • Relation to classical SRSWOR: For equal weights, these new martingale-based bounds recover the established hypergeometric large deviations.

6. Applications in Experimental Design, Urn Models, and Combinatorial Optimization

PPSWOR is the de facto standard in a variety of domains:

  • Complex surveys and randomized experiments: In cluster-randomized trials, PPSWOR at the cluster level (with further SRS within clusters) produces unbiased and location-invariant estimators of treatment effects, with robust variance properties (Xiong et al., 2020).
  • Generalized urn models: PPSWOR is realized as sequential sampling in two-color (or multi-color) Pólya–Eggenberger urn models, with arbitrary weight sequences. This framework unifies the classical hypergeometric case and allows both normal and non-normal (theta-function) limits, depending on weight structure (Kuba, 2010).
  • Online algorithms and rounding: In online bipartite (and stochastic) matching, PPSWOR serves as an efficient rounding scheme for fractional matchings, yielding competitive guarantees (≥0.707 in stochastic and ≥0.513 in adversarial models), and is essential for negative dependence and correlated selection properties (Huang et al., 2024).
  • Sketch-based data summarization: Bottom-ii9 and order-statistics-based PPSWOR constructs, extended to wiw_i0-weighted objectives and signed data, underlie scalable randomized algorithms for large-scale machine learning and data analytics (Cohen et al., 2020).

7. Extensions, Limitations, and Recent Algorithmic Advances

While classic PPSWOR achieves fixed sample size and size-proportional marginal probabilities asymptotically, exact proportionality cannot, in general, be simultaneously maintained with strict sample size constraints if some weights are large relative to wiw_i1 (Hentschel et al., 2021). The EB-PPS algorithm introduces a controlled tradeoff: it enforces exact PPS in marginal probabilities, caps (but does not always fix) the sample size, and provides a simple amortized wiw_i2-time, one-pass streaming implementation. This method strictly generalizes fixed-size and pure-PPS sampling, yielding maximal expected sample size and minimal variance under an explicit sampling budget.

Modern sketch-based PPSWOR algorithms are composable, memory-efficient (wiw_i3 space), and extend naturally to non-linear and signed weighting scenarios. These advances guarantee worst-case bias and variance bounds near those of classical PPSWOR, and in certain cases outperform with-replacement sampling in both estimator stability and representation diversity (Cohen et al., 2020).

References

  • (Yu, 2010) On the inclusion probabilities in some unequal probability sampling plans without replacement
  • (Kuba, 2010) On Sampling without replacement and OK-Corral urn models
  • (Hentschel et al., 2021) Exact PPS Sampling with Bounded Sample Size
  • (Xiong et al., 2020) The Benefits of Probability-Proportional-to-Size Sampling in Cluster-Randomized Experiments
  • (Makela et al., 2017) Bayesian Inference under Cluster Sampling with Probability Proportional to Size
  • (Foster et al., 2024) Large Deviations Inequalities for Unequal Probability Sampling Without Replacement
  • (Huang et al., 2024) Online Matching Meets Sampling Without Replacement
  • (Cohen et al., 2020) WOR and wiw_i4’s: Sketches for wiw_i5-Sampling Without Replacement

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Probability Proportional to Size Without Replacement (PPSWOR).