Papers
Topics
Authors
Recent
2000 character limit reached

Complexity-Based Weighted Sampling

Updated 9 December 2025
  • Complexity-Based Weighted Sampling is an approach that allocates computational effort based on intrinsic data complexity and structural irregularity.
  • It leverages adaptive thresholding, hash-based partitioning, and domain decomposition to efficiently sample under resource and logical constraints.
  • The methodology achieves near-optimal performance with provable bounds in distributed streaming, multi-criteria decision making, SAT solving, and graph learning applications.

A complexity-based weighted sampling strategy encompasses algorithmic frameworks that adaptively allocate computational, statistical, or communication resources for weighted sampling according to the intrinsic complexity, sparsity, or geometric irregularity of the underlying data or problem domain. “Complexity” in this context refers both to computational cost (e.g., time, messages, or memory required for sampling) and to the combinatorial structure (e.g., distribution skew, Pareto-front nonlinearity, or logical constraints) that governs how sampling effort should be distributed for statistical or solution representativity. Leading complexity-based frameworks include message-optimal distributed reservoir sampling; adaptive, Pareto-front–refining weight-sample allocation for multi-criteria decision making; and techniques for efficient weighted-sample generation under logical, structural, or high-dimensional constraints. These strategies achieve provable optimality—often matching lower bounds—by selectively increasing sampling density or communication in regions of high information, geometric curvature, or weight heterogeneity, and reducing redundant effort in simpler regions.

1. Problem Characterization and Formal Definitions

Complexity-based weighted sampling arises in multiple domains where the objective is to select elements, assignments, or models from a large or distributed set, such that the probability of selection is proportional to item weights, subject to resource constraints or specific structure. Several core settings include:

  • Weighted Reservoir Sampling in Distributed Streams: kk sites each observe a stream SiS_i of pairs (e,w)(e, w) (item, weight), and the goal is for a coordinator to continually maintain an exact weighted sample without replacement (SWOR) of size ss—sampling each subset RR with

Pr[R]=j=1swijWt<jwit,\Pr[R] = \prod_{j=1}^s \frac{w_{i_j}}{W - \sum_{t < j} w_{i_t}},

where W=(e,w)SwW = \sum_{(e, w)\in S} w—with minimal communication and storage (Jayaram et al., 2019).

  • MCDM Weighted-Sum Scalarisation: Given pp objectives and the weighted-sum minimization

minxXi=1pλifi(x),λΔp,\min_{x\in X} \sum_{i=1}^p \lambda_i f_i(x),\quad \lambda\in\Delta_p,

select a set {λ(1),,λ(N)}\{\lambda^{(1)}, \dots, \lambda^{(N)}\} to efficiently recover a representative set of Pareto-efficient solutions; adapt weight-sampling density to the complexity (e.g., curvature or coverage gaps) of the Pareto front (Williams et al., 2024).

  • SAT Model Sampling and Counting: Given a CNF FF and weight function w:{0,1}n(0,1]w : \{0,1\}^n \to (0,1], sample satisfying assignments yy with probability proportional to w(y)w(y), optimizing the number of oracle/SAT queries in the presence of skewed weights (“tilt”) and logical support partitioning (Chakraborty et al., 2014, Riguzzi, 2024).
  • First-Order Weighted Model Sampling: For function-free FO2{}^2 or C2{}^2 sentences, sample models by decomposing domain assignment (“types”) and recursing with complexity only polynomial in the domain size, navigating the combinatorial configuration space (Wang et al., 2023).

Problem complexity is thus jointly determined by the distribution of weights, the structural properties of the domain or constraint system, and the performance metrics (e.g., message count, sample size, coverage or representativity).

2. Key Methodological Principles

Complexity-based weighted sampling algorithms leverage several fundamental techniques to minimize computational resources, measurement error, or communication given “hard” underlying distributions or structure. The dominant methodological motifs are:

  • Prioritization and Filtering by Thresholds (“Epochs”, “Levels”): Incoming items or weight vectors are only processed or forwarded if they exceed adaptive thresholds derived from the distribution of previously seen priorities, dramatically reducing unnecessary resource consumption. For streaming SWOR, priorities are set as v=w/tv = w / t, tExp(1)t\sim \mathrm{Exp}(1), and only items that might enter the reservoir (exceeding current ssth‐largest key uu) or that saturate a “level” (weight interval) are sent. Epoch transitions and level saturations are globally coordinated to synchronize threshold changes and redistribute responsibility across sites (Jayaram et al., 2019).
  • Adaptive Refinement in Weight Space (“Structured Adaptive Sampling”): In MCDM, initial coarse weight grids identify regions on the Pareto front with large objective-space gaps (n(a)n(b)2>τ\|n(a) - n(b)\|_2 > \tau). Only those intervals are subdivided, biasing new samples to regions where the front is complex or currently under-represented. Sampling halts when either redundancy (ratio of distinct Pareto points to total weights tried) drops below a threshold ρ\rho or all interpolated distances fall below τ\tau (Williams et al., 2024).
  • Hash-based Decomposition for Distribution-Aware Model Sampling: For SAT, random low-independence XOR hashes hh partition the solution space into cells. Sampling is performed only within cells whose total weight lies in a target window [loThresh,hiThresh][loThresh, hiThresh], tailored to the current estimate of model-space “tilt” and available solver resources. This partitioning equalizes expected cell weight and optimally reduces enumeration complexity (Chakraborty et al., 2014).
  • Domain Decomposition and Lifted Recursion: In symmetric FOL/C2{}^2 model sampling, complexity is managed via a two-tiered partition: assignment of 1-types (domain-level partitions), followed by 2-table (pairwise relation) assignment via domain recursion. Each step only processes O(poly(n))O(\mathrm{poly}(n)) configurations, with complexity controlled by the logic fragment and number of quantifiers (Wang et al., 2023).
  • Alias/Index Structures for Constant-time Sampling: When sampling in weighted graphs (e.g., for network embedding), precomputed alias tables indexed by vertices, edges, or other context features enable O(1)O(1) per-sample effort independent of weight distribution complexity. This allows unbiased SGD updates despite highly skewed weighting (Chen et al., 2017).

These frameworks are unified by their explicit attempt to direct computational or statistical effort to “difficult” regions—whether these are high-weight, high-curvature, or low-redundancy—while suppressing wasteful or redundant effort in “simple” or flat regions.

3. Optimality Results, Complexity Bounds, and Lower Limits

Complexity-based weighted sampling strategies are designed to match or approach worst-case lower bounds for resource metrics under adversarial or pathological distributions. Significant results include:

  • Distributed Weighted SWOR Communication: Total message complexity is

O(klog(W/s)log(1+k/s))O\left(k\,\frac{\log(W/s)}{\log(1 + k/s)}\right)

for kk sites, total weight WW, and sample size ss; no o(1)o(1) factor remains to be closed, even in the unweighted case (Jayaram et al., 2019). The protocol is message-, space-, and time-optimal up to universal constants.

  • Weighted Sum Estimation Sample Complexity: For proportional-only sampling, O(n/ε)O(\sqrt{n}/\varepsilon) samples are necessary and sufficient for (1±ε)(1\pm\varepsilon) accuracy; for hybrid proportional+uniform, the bound is O(n1/3/ε4/3)O(n^{1/3}/\varepsilon^{4/3}). Matching lower bounds are established under hypothesis-testing arguments between hard-to-distinguish weight mixtures (Beretta et al., 2021).
  • MCDM Adaptive Sampling Cost: Systematic grid search scales as (d+p1p1)\binom{d+p-1}{p-1} in the grid-depth dd and objective count pp, while complexity-based adaptive refinement only targets “curved” regions, yielding far fewer required weights for front coverage at equivalent resolution. The worst case, where all intervals must be subdivided, can be exponential in depth, but empirical results indicate M(d+p1p1)M \ll \binom{d+p-1}{p-1} when the front is simple (Williams et al., 2024).
  • Weighted Model Sampling (SAT, Weighted Constrained Sampling): The expected number of SAT queries required by WeightGen is O(r/ε2)O(r/\varepsilon^2) where rr is the tilt, but can be reduced to O(logr)O(\log r) if weights are factored. Quantum algorithms achieve an oracle complexity O(2n/2/WMC)O(2^{n/2}/\sqrt{\mathrm{WMC}}), which is quadratically faster than the classical Ω(2n/WMC)\Omega(2^n/\mathrm{WMC}) bound (Chakraborty et al., 2014, Riguzzi, 2024).
  • Variance–Cost Trade-offs in Adaptive MCMC: For variable-complexity weighted Gibbs, variance of the Rao–Blackwellized estimator scales as O((P/S)2logTT)O((P/S)^2 \frac{\log T}{T}) for PP signal dimension, computational budget SS, and total iterations TT—formally quantifying the cost-variance trade-off (Truong, 2023).

These complexity guarantees render the strategies robust to worst-case scenarios and justify their aggressive biasing of resource allocation to “difficult” subspaces.

4. Algorithmic Instantiations and Pseudocode

The complexity-based weighted sampling regime encompasses several concrete algorithmic instantiations:

  1. Priority assignment: For each incoming (e,w)(e, w), draw tExp(1)t \sim \mathrm{Exp}(1) and set v=w/tv = w / t.
  2. Epoch synchronization: Update the ss-th largest key uu at the coordinator; propagate new epoch when uu crosses rr-powers (r=max{2,k/s}r = \max\{2, k/s\}).
  3. Level-based batch handling: Weight levels [rj,rj+1)[r^j, r^{j+1}) accumulate $4 r s$ unkeyed entries, after which batch retroactive key assignment and further regular keying commence.
  1. Initialize weights by a coarse grid (Uniform Increment d0d_0).
  2. Solve each subproblem, collect nondominated points NN.
  3. While redundancy or coverage gap exceeds thresholds:
    • Identify intervals with n(a)n(b)2>τ\|n(a) - n(b)\|_2 > \tau.
    • Subdivide those intervals; sample new λ\lambda in subintervals.
    • Update NN and repeat.
  4. Stop when all significant coverage gaps are eliminated or redundancy criterion met.
  1. Estimate total solution weight W(RF)W(R_F) via WeightMC.
  2. For grid of cell sizes 2i2^i, select random 3-wise independent hash hh; for random target cell α\alpha, enumerate all solutions in RF,h,αR_{F, h, \alpha}, conditional on residing in prescribed weight bounds.
  3. Uniformly (by weight) select and output a solution in the accepted cell.
  1. Enumerate all possible 1-type partitions (size configurations) of the domain.
  2. For each, recursively sample 2-tables (“cell–partition enumeration”) by induction, updating the problem at each step.

Pseudocode for each method is provided in the respective literature and is central for practical deployment and transferability.

5. Applications and Empirical Findings

Complexity-based weighted sampling strategies have found deployment across multiple application domains with empirical validation:

  • Data Streams and Heavy Hitters: Used for tracking distributed heavy hitters with residual error, 1\ell_1 tracking, and dynamic frequency estimation in streaming data (Jayaram et al., 2019).
  • Multi-Objective Optimization: MCDM scalarisation with adaptive weight refinement efficiently recovers diverse Pareto sets with minimal redundancy, outperforming naive or grid-only approaches in head-to-head computational studies (Williams et al., 2024).
  • SAT and Model Sampling: In weighted model counting and sampling for probabilistic reasoning, hardware verification, and constraint satisfaction, distribution-aware and complexity-biased sampling algorithms scale to tens of thousands of variables with tight empirical agreement to theoretical frequency guarantees (Chakraborty et al., 2014).
  • Graph Machine Learning: Weighted-vertex sampling via alias-structures is employed in large-scale graph embedding and recommendation (e.g., MovieLens, KKBOX), yielding both higher-quality embeddings and improved computational efficiency compared to uniform schemes (Chen et al., 2017).
  • Kernel Approximation: Leverage-based weighted sampling in Random Fourier Features matches statistical guarantees of kernel approximation while dramatically reducing computation time (Liu et al., 2019).

Typically, such strategies not only improve computational efficiency but also attain provable representativity or unbiasedness, provided resource thresholds and stopping criteria are tuned according to theoretical guidance.

6. Trade-offs, Interpretation, and Practical Guidance

Implementing complexity-based weighted sampling involves explicit resource–accuracy trade-offs and parameter choices:

  • Communication vs. Sample Quality: In streaming/reservoir sampling, increasing sample size ss relative to kk reduces communication cost, but only via slow logarithmic effects. Level-based batching is mandatory for handling heavy-tailed weight distributions (Jayaram et al., 2019).
  • Adaptive Refinement Thresholds: In weighted-sum scalarisation, the choice of tolerance τ\tau and redundancy ρ\rho governs the exploration-vs-exploitation balance; empirical evidence suggests starting with coarse grids and refining only high-gap regions is nearly optimal (Williams et al., 2024).
  • Tilt and Partitioning: In hash-based SAT sampling, complexity scales linearly or logarithmically with tilt, depending on the weight representation; careful selection of partition parameters is required for efficiency and success probability (Chakraborty et al., 2014).
  • Variance–Cost in MCMC: The per-iteration budget SS in variable-complexity weighted Gibbs sampling controls the variance–time trade-off, with empirical studies indicating S/P0.10.2S/P \approx 0.1–0.2 as a practical rule for balancing resource usage and accuracy in high dimensions (Truong, 2023).

In summary, complexity-based weighted sampling strategies generalize and subsume fixed, uniform approaches by responding intelligently to provably “hard” structure in the weighting or solution space, achieving both optimality in resource metrics and practical advantage in real-world deployments across disciplines.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Complexity-Based Weighted Sampling Strategy.