Probability Proportional to Size Without Replacement
- PPSWOR is a statistical sampling method that selects units without replacement, with selection probability proportional to user-defined weights.
- It employs algorithms like rejective and successive sampling to ensure proportional inclusion probabilities and induce negative correlations among units.
- The method is pivotal in survey design, randomized experiments, and streaming data applications for efficient variance reduction and scalable computation.
Probability Proportional to Size Without Replacement (PPSWOR) is a fundamental statistical sampling scheme wherein units are drawn sequentially from a finite population without replacement, and at each draw, the probability of selection is proportional to a user-defined “size” or weight associated with each unit. PPSWOR is central to a spectrum of applications, including complex survey design, randomized experiments, distributed data summarization, scalable learning algorithms, and theoretical analyses of dependent selection processes.
1. Formal Definitions, Probabilistic Structure, and Canonical Algorithms
Given a population with non-negative weights , PPSWOR refers to sampling a subset of fixed size such that, at each selection step, an element is chosen with probability divided by the sum of weights of remaining (unselected) elements. Formally, if denotes the set of as-yet-unsampled units at step , the conditional probability of selecting is: Classic PPSWOR schemes include:
- Hájek’s rejective (conditional Poisson) sampling: Equivalent to repeatedly drawing Bernoulli indicators with Poissonized marginal probabilities, then conditioning on the sample having fixed size; yields computationally tractable expressions for inclusion probabilities via elementary symmetric polynomials (Yu, 2010).
- Successive (sequential) PPSWOR: Sampling with replacement, discarding repeats, until 0 distinct items are drawn; admits an explicit albeit complex sum-of-products representation for inclusion probabilities.
- Streaming and algorithmic variants: Schemes such as the EB-PPS algorithm provide exact PPSWOR (i.e., sticking to prescribed proportionality in marginal inclusion probabilities) while controlling sample size in a single pass over weighted streams (Hentschel et al., 2021).
For any unit 1, the first-order inclusion probability under PPSWOR with sample size 2 and weights 3 (normalized, so 4) is:
- Rejective: 5, where 6 is the 7th elementary symmetric sum.
- Successive: 8, with 9 the intricate sum detailed in (Yu, 2010).
2. Key Theoretical Properties: Uniformity, Proportionality, and Majorization
PPSWOR methods, in contrast to with-replacement sampling, induce dependencies and negative correlations among the inclusion indicators of different units. The inclusion probabilities inherit important structural properties:
- Majorization order: For both rejective and successive PPSWOR, as 0 increases, the per-sample inclusion vector 1 becomes more uniform, i.e., majorizes those at smaller 2. For fixed 3 and draw probabilities 4, the inclusion probabilities produced by rejective sampling are always more uniform (in majorization sense) than those from successive sampling, confirming the conjecture of Hájek (Yu, 2010).
- Kullback-Leibler divergence: Inclusion probabilities from successive PPSWOR are always closer (in KL divergence) to the original drawing probabilities 5 than those from rejective PPSWOR. That is, 6 is more “proportional” to 7, while 8 is more “uniform” (Yu, 2010).
- Negative correlation: In any PPSWOR scheme, inclusion indicators for distinct units exhibit negative dependence; 9 (Huang et al., 2024).
These facts have critical implications for variance reduction, estimator design, and the analysis of sampling-based algorithms.
3. Algorithms, Computational Complexity, and Streaming Schemes
Computing inclusion probabilities in PPSWOR depends on the chosen variant and the population size.
- Rejective sampling: Efficient dynamic programming via recursion on symmetric polynomials computes all inclusion probabilities in 0 time (Yu, 2010).
- Successive sampling: The sum-of-products formula requires potentially exponential time; practical methods include Monte Carlo simulation, sandwiching by majorization bounds, or iterative refinement (Yu, 2010).
- Streaming EB-PPS: The EB-PPS algorithm maintains a “latent sample” representation, updating inclusion probabilities and sample content in 1 amortized time per item. At each step, proportionality factor 2 is updated as 3, and inclusion probabilities are set as 4 (Hentschel et al., 2021).
Modern applications employ sketch-based algorithms, such as bottom-5 and 6-residual heavy hitter sketches, to perform PPSWOR efficiently in high-dimensional streaming or distributed environments (Cohen et al., 2020).
| Algorithm Type | Input Model | Complexity |
|---|---|---|
| Rejective (Hájek) | Static | 7 |
| Successive (sequential) | Static | 8 (DP); MC feasible for large N |
| EB-PPS (streaming) | Streaming | 9 amortized per item |
4. Statistical Inference: Estimation, Variance, and Design-based/Bayesian Methods
PPSWOR underpins the design of unbiased estimators in finite population sampling, most notably the Horvitz-Thompson (HT) estimator. For population mean or total estimation:
- HT estimator: If 0 is the inclusion probability of unit 1, the estimator is 2.
- Variance: The variance involves both first- and second-order inclusion probabilities 3. For example, in cluster-randomized experiments, closed forms for the variance are available under PPSWOR, and conservative unbiased estimators (Sen-Yates-Grundy) can be employed even when joint probabilities are approximated (Xiong et al., 2020).
- Bayesian model-based inference: For two-stage cluster sampling by PPSWOR, Bayesian frameworks account for uncertainty in cluster size (for units not sampled), providing integrated, design-aware posterior inference with efficiency gains over classical approaches (Makela et al., 2017).
Practical computation of joint inclusion probabilities and variances can rely on analytical approximations (e.g., Hájek’s, Berger-Sitter) or simulation; specialized software, such as the R package TeachingSampling, supports standard PPSWOR algorithms including Sunter’s method (Xiong et al., 2020).
5. Large Deviations, Concentration, and New Theoretical Bounds
Recent advances provide sharp deviation inequalities for PPSWOR:
- Martingale and pivotal methods: The Deville–Tillé pivotal method generates the fixed-size sample as a vector-martingale, providing the basis for Freedman (Bennett-type) (Foster et al., 2024), Chernoff-type (Kullback-Leibler), and Hoeffding-Azuma concentration bounds for the number of sampled units falling in any subset 4 of the population:
5
for subset weight 6, variance parameter 7, and sample size 8 (Foster et al., 2024). These bounds provide quantitative guarantees for the concentration of PPSWOR samples and adapt classical hypergeometric tail estimates to the unequal-probability setting.
- Relation to classical SRSWOR: For equal weights, these new martingale-based bounds recover the established hypergeometric large deviations.
6. Applications in Experimental Design, Urn Models, and Combinatorial Optimization
PPSWOR is the de facto standard in a variety of domains:
- Complex surveys and randomized experiments: In cluster-randomized trials, PPSWOR at the cluster level (with further SRS within clusters) produces unbiased and location-invariant estimators of treatment effects, with robust variance properties (Xiong et al., 2020).
- Generalized urn models: PPSWOR is realized as sequential sampling in two-color (or multi-color) Pólya–Eggenberger urn models, with arbitrary weight sequences. This framework unifies the classical hypergeometric case and allows both normal and non-normal (theta-function) limits, depending on weight structure (Kuba, 2010).
- Online algorithms and rounding: In online bipartite (and stochastic) matching, PPSWOR serves as an efficient rounding scheme for fractional matchings, yielding competitive guarantees (≥0.707 in stochastic and ≥0.513 in adversarial models), and is essential for negative dependence and correlated selection properties (Huang et al., 2024).
- Sketch-based data summarization: Bottom-9 and order-statistics-based PPSWOR constructs, extended to 0-weighted objectives and signed data, underlie scalable randomized algorithms for large-scale machine learning and data analytics (Cohen et al., 2020).
7. Extensions, Limitations, and Recent Algorithmic Advances
While classic PPSWOR achieves fixed sample size and size-proportional marginal probabilities asymptotically, exact proportionality cannot, in general, be simultaneously maintained with strict sample size constraints if some weights are large relative to 1 (Hentschel et al., 2021). The EB-PPS algorithm introduces a controlled tradeoff: it enforces exact PPS in marginal probabilities, caps (but does not always fix) the sample size, and provides a simple amortized 2-time, one-pass streaming implementation. This method strictly generalizes fixed-size and pure-PPS sampling, yielding maximal expected sample size and minimal variance under an explicit sampling budget.
Modern sketch-based PPSWOR algorithms are composable, memory-efficient (3 space), and extend naturally to non-linear and signed weighting scenarios. These advances guarantee worst-case bias and variance bounds near those of classical PPSWOR, and in certain cases outperform with-replacement sampling in both estimator stability and representation diversity (Cohen et al., 2020).
References
- (Yu, 2010) On the inclusion probabilities in some unequal probability sampling plans without replacement
- (Kuba, 2010) On Sampling without replacement and OK-Corral urn models
- (Hentschel et al., 2021) Exact PPS Sampling with Bounded Sample Size
- (Xiong et al., 2020) The Benefits of Probability-Proportional-to-Size Sampling in Cluster-Randomized Experiments
- (Makela et al., 2017) Bayesian Inference under Cluster Sampling with Probability Proportional to Size
- (Foster et al., 2024) Large Deviations Inequalities for Unequal Probability Sampling Without Replacement
- (Huang et al., 2024) Online Matching Meets Sampling Without Replacement
- (Cohen et al., 2020) WOR and 4’s: Sketches for 5-Sampling Without Replacement