Weighted Sampling Strategy

Updated 18 April 2026

Weighted sampling strategy is a method where each item is selected with a probability proportional to its assigned weight, ensuring representativeness in diverse data applications.
It employs various algorithmic techniques such as reservoir, WOR, and coordinated sampling to efficiently balance computational constraints and statistical accuracy.
Applications span multiple domains—from temporal knowledge graphs and subgraph counting to language modeling—yielding improvements in estimation accuracy and model performance.

Weighted sampling strategy refers to any method in which items, sets, or interactions are drawn at random from a population with probabilities proportional to specified weights associated with the items. Weighted sampling appears as a central component in numerous subfields (e.g., streaming algorithms, graph analysis, machine learning, temporal knowledge graphs, survey inference, privacy-preserving computation). It encompasses a diverse array of algorithmic approaches depending on theoretical goals, structural constraints, and computational architectures.

1. Mathematical Foundations and Weight Construction

Weighted sampling is formally defined by associating with each unit $i$ a nonnegative weight $w_i>0$ . For a population of $N$ items, the probability of drawing item $i$ can be either

With replacement: $p(i)=w_i/\sum_{j=1}^N w_j$ for each draw, or
Without replacement: entries are drawn one by one, each time proportionally to their remaining (unpicked) weight, so the exact probability that $i$ appears in a sample of size $n$ is more intricate and involves combinatorial weighting over sampling orders (Ben-Hamou et al., 2016 Hübschle-Schneider et al., 2019).

In many applications, the weighting function is not static but dynamically determined by properties of the data (e.g., frequency, importance, RL-predicted utility) or by statistical requirements (e.g., inverse-probability weights, stratification, sample design adjustments). For example, in temporal knowledge graphs, the weight of a quadruple $q=(s,r,o,t)$ is set as a symmetric function of the inverse frequencies of $s$ and $o$ , using statistics over the training stream so that rare entities are up-sampled (Mirtaheri et al., 25 Jul 2025):

$w_i>0$ 0

In streaming subgraph counting, edge weights $w_i>0$ 1 are determined by local and temporal feature vectors, which may be optimized using RL to minimize estimation error for downstream tasks (Wang et al., 2022).

2. Weighted Sampling Algorithms: Core Procedures

Weighted sampling algorithms are distinguished by both sampling paradigm (with/without replacement, sequential/parallel) and by the structure of the population (flat sets, graphs, streams, joins, key‐value maps):

Reservoir Sampling for Streams: In sequential streaming, weighted reservoir sampling ensures that at all times the reservoir contains $w_i>0$ 2 i.i.d. samples proportional to current weights (Meligrana, 2024 Jayaram et al., 2019). For with‐replacement sampling, each new arrival $w_i>0$ 3 with weight $w_i>0$ 4 replaces existing reservoir entries with probability $w_i>0$ 5 (running total). A skip-based generalization computes, in expectation, the number of items to skip before the next replacement—greatly increasing efficiency for small $w_i>0$ 6.
Without Replacement ("WOR") Sampling: Statistically, the concentration behavior of weighted sampling without replacement is controlled via martingale couplings and submartingale inequalities (Ben-Hamou et al., 2016). Algorithmically, WOR is often implemented using bottom-k or priority-key constructions: assign each item $w_i>0$ 7 a key $w_i>0$ 8 where $w_i>0$ 9 or related, and select the top $N$ 0 keys as the sample (Cohen et al., 2020 Hübschle-Schneider et al., 2019).
Parallel/Distributed Settings: Efficient constructions (e.g., distributed alias tables, mapping-based reductions) support shared/distributed-memory for high-velocity streaming or large populations, achieving near-linear speedup (Hübschle-Schneider et al., 2019).
Batch/Minibatch Sampling in ML: Sampling batches with a fraction $N$ 1 chosen according to a weighted distribution (e.g., frequency-inverse) and the remainder uniformly is used in TKG and masked language modeling to prioritize rare or poorly-learned items while maintaining generalization (Mirtaheri et al., 25 Jul 2025 Zhang et al., 2023).
Coordinated/Correlated Sampling: For multiple related weight assignments (e.g., multi-period, multi-objective, multi-attribute data), coordinated bottom-k sampling via shared random seeds provides order-of-magnitude variance reduction for estimating aggregate functions involving max, min, or $N$ 2 differences (0906.4560).

3. Adaptive and Optimized Weighted Sampling

Optimizing weighted sampling schedules is essential for efficiency and variance reduction. Typical adaptive strategies include:

Variance-driven bin allocation (weighted ensemble sampling): In multiscale/Markov chain contexts, particles/replicas are allocated according to the square root of local variance (as estimated from a coarse model), minimizing mean squared error in time- or steady-state averages. The allocation formula is (Aristoff et al., 2018 Aristoff, 2016):

$N$ 3

where $N$ 4 is an estimate of the local mutation variance in bin $N$ 5.

Reinforcement-learning optimized weights: In online streaming, RL is used to adapt edge weights dynamically for subgraph-reservoir sampling, balancing the value of immediate vs. future subgraph closures (Wang et al., 2022).
Active and stratified weighted walks: In high-skew graphs, stratified weighted random walks modulate edge weights according to strata and variance proxies, efficiently oversampling small or important categories while controlling Markov chain mixing (Kurant et al., 2011).

4. Applications Across Domains

Weighted sampling serves as a fundamental primitive in many research areas:

Domain	Objective	Weighted Sampling Role
Streaming/Sketches	Sketch-based estimates of aggregates, heavy hitters	Bottom- $N$ 6, Poisson, $N$ 7-norm, and reservoir techniques (Cohen et al., 2020 Hübschle-Schneider et al., 2019)
Survey Inference	Design-based estimation with unequal inclusion probs	Weighted likelihood bootstrap, sandwich variance adjustment (Das et al., 15 Apr 2025)
Knowledge Graphs	Robust link prediction in long-tail, incremental graphs	Batch selection favoring rare-entity quadruples (Mirtaheri et al., 25 Jul 2025)
LLMs	Unbiased token embedding for rare-word representations	Token-masking probability proportional to inverse frequency or loss (Zhang et al., 2023)
Differential Privacy	Release of private samples/summary statistics	Post-processing nonprivate samples with DP-optimally adjusted weights (Cohen et al., 2020)
Graph Sampling	Extraction of representative subgraphs in massive graphs	Adaptive edge weighting and local update rules (Yousuf et al., 2019)
Multi-Criteria Optimization	Pareto front approximation in MCDM	Systematic grid, Dirichlet, stratified simple sampling (Williams et al., 2024)
Joins and Relational Data	Sampling from huge relational joins	Dynamic-programming weights, join-tree sampling (Shekelyan et al., 2022)

Each setting tailors the notion of "importance" or "rarity" to a problem-specific signal measured by the weighting scheme, and the sampling algorithm is correspondingly adapted to exploit computational structure (e.g., streaming, batch, parallel).

5. Empirical Impact and Trade-offs

Numerous studies consistently demonstrate the impact of weighted sampling on estimation accuracy, model performance, and computational efficiency. For example, upweighting rare entities in TKG completion methods yields $N$ 8-- $N$ 9 MRR improvements over uniform sampling, with negligible overhead when applied at the data-loader level (Mirtaheri et al., 25 Jul 2025). In streaming subgraph estimation, fine-tuned RL-based weighting delivers $i$ 0-- $i$ 1 lower relative error and $i$ 2-- $i$ 3 faster updates compared to uniform sampling of edges (Wang et al., 2022). In unsupervised LLM training, dynamic or frequency-based weighted masking raises sentence-representation quality (Spearman's $i$ 4) by $i$ 5-- $i$ 6 points in STS tasks, mainly via improved rare-token embeddings (Zhang et al., 2023).

Key trade-offs include:

Tuning the fraction $i$ 7 of weighted sampling vs. uniform to balance rare example focus and generalizability (best results often at $i$ 8).
Computational complexity vs. statistical benefit: skip-based reservoir improves over naive $i$ 9-per-update for small sample-to-population ratios, but overhead dominates at high ratios (Meligrana, 2024).
Memory and message complexity: distributed weighted SWOR achieves near-optimal $p(i)=w_i/\sum_{j=1}^N w_j$ 0 communication, in contrast to naive global coordination (Jayaram et al., 2019).
Redundancy vs. coverage in weight simplex sampling: grid-based approaches guarantee uniformity but scale poorly with high objectives; random Dirichlet or stratified LHS/LHHS offer scalable alternatives with stochastic coverage (Williams et al., 2024).

6. Theoretical Guarantees and Statistical Properties

Weighted sampling algorithms are subject to rigorous unbiasedness and concentration guarantees:

Horvitz-Thompson estimators: For any $p(i)=w_i/\sum_{j=1}^N w_j$ 1, $p(i)=w_i/\sum_{j=1}^N w_j$ 2 is unbiased when each $p(i)=w_i/\sum_{j=1}^N w_j$ 3 is included in the sample with known probability $p(i)=w_i/\sum_{j=1}^N w_j$ 4 (0906.4560).
Martingale submartingale coupling: Sampling sums without replacement exhibit sub-Gaussian concentration similar to with-replacement, and variance improves as the unsampled mass decreases (Ben-Hamou et al., 2016).
Bounds on sample complexity for sum estimation: In the proportional-sampling model, $p(i)=w_i/\sum_{j=1}^N w_j$ 5 samples suffice and are necessary for estimating $p(i)=w_i/\sum_{j=1}^N w_j$ 6 to relative error $p(i)=w_i/\sum_{j=1}^N w_j$ 7 with constant probability (Beretta et al., 2021).

For ensemble and stratified methods, rigorous optimization of allocation variables delivers provably minimal variance subject to budget constraints (Aristoff et al., 2018 Aristoff, 2016 Kurant et al., 2011). For private weighted sampling, the calibrated inclusion probabilities maximize reporting consistent with $p(i)=w_i/\sum_{j=1}^N w_j$ 8-DP constraints and rigorously outperform baseline histogram methods (Cohen et al., 2020).

7. Implementation, Tuning, and Best Practices

Best practices for weighted sampling depend on the application context and computational regime:

Maintain efficient data structures (alias tables, Fenwick trees, hash-based maps) for $p(i)=w_i/\sum_{j=1}^N w_j$ 9 or $i$ 0 draw/update for static populations (Hübschle-Schneider et al., 2019).
In streaming/minibatch contexts, update weighting statistics incrementally, avoiding reliance on full-data precomputation (Mirtaheri et al., 25 Jul 2025 Zhang et al., 2023).
Empirically tune $i$ 1, weighting functions (min, max, mean), smoothing parameters, and batch sizes to optimize out-of-sample performance or estimation error.
For parallel/distributed, partition sampling responsibilities (e.g., multinomial over total weights, independent local skip-based sampling (Meligrana, 2024)), then merge for correct output distribution.
For multiple objectives/attributes, construct coordinated sketches with shared randomization, and always use the inclusive estimator for multi-assignment aggregates (0906.4560).

Weighted sampling is thus a unifying paradigm underpinning variance reduction, fairness, rare-event capture, and scalable analytics in modern computational data science. Its rigorous theoretical footing and broad empirical success make it foundational in both classical statistical and modern machine learning pipelines.