Complexity-Based Weighted Sampling
- Complexity-Based Weighted Sampling is an approach that allocates computational effort based on intrinsic data complexity and structural irregularity.
- It leverages adaptive thresholding, hash-based partitioning, and domain decomposition to efficiently sample under resource and logical constraints.
- The methodology achieves near-optimal performance with provable bounds in distributed streaming, multi-criteria decision making, SAT solving, and graph learning applications.
A complexity-based weighted sampling strategy encompasses algorithmic frameworks that adaptively allocate computational, statistical, or communication resources for weighted sampling according to the intrinsic complexity, sparsity, or geometric irregularity of the underlying data or problem domain. “Complexity” in this context refers both to computational cost (e.g., time, messages, or memory required for sampling) and to the combinatorial structure (e.g., distribution skew, Pareto-front nonlinearity, or logical constraints) that governs how sampling effort should be distributed for statistical or solution representativity. Leading complexity-based frameworks include message-optimal distributed reservoir sampling; adaptive, Pareto-front–refining weight-sample allocation for multi-criteria decision making; and techniques for efficient weighted-sample generation under logical, structural, or high-dimensional constraints. These strategies achieve provable optimality—often matching lower bounds—by selectively increasing sampling density or communication in regions of high information, geometric curvature, or weight heterogeneity, and reducing redundant effort in simpler regions.
1. Problem Characterization and Formal Definitions
Complexity-based weighted sampling arises in multiple domains where the objective is to select elements, assignments, or models from a large or distributed set, such that the probability of selection is proportional to item weights, subject to resource constraints or specific structure. Several core settings include:
- Weighted Reservoir Sampling in Distributed Streams: sites each observe a stream of pairs (item, weight), and the goal is for a coordinator to continually maintain an exact weighted sample without replacement (SWOR) of size —sampling each subset with
where —with minimal communication and storage (Jayaram et al., 2019).
- MCDM Weighted-Sum Scalarisation: Given objectives and the weighted-sum minimization
select a set to efficiently recover a representative set of Pareto-efficient solutions; adapt weight-sampling density to the complexity (e.g., curvature or coverage gaps) of the Pareto front (Williams et al., 2024).
- SAT Model Sampling and Counting: Given a CNF and weight function , sample satisfying assignments with probability proportional to , optimizing the number of oracle/SAT queries in the presence of skewed weights (“tilt”) and logical support partitioning (Chakraborty et al., 2014, Riguzzi, 2024).
- First-Order Weighted Model Sampling: For function-free FO or C sentences, sample models by decomposing domain assignment (“types”) and recursing with complexity only polynomial in the domain size, navigating the combinatorial configuration space (Wang et al., 2023).
Problem complexity is thus jointly determined by the distribution of weights, the structural properties of the domain or constraint system, and the performance metrics (e.g., message count, sample size, coverage or representativity).
2. Key Methodological Principles
Complexity-based weighted sampling algorithms leverage several fundamental techniques to minimize computational resources, measurement error, or communication given “hard” underlying distributions or structure. The dominant methodological motifs are:
- Prioritization and Filtering by Thresholds (“Epochs”, “Levels”): Incoming items or weight vectors are only processed or forwarded if they exceed adaptive thresholds derived from the distribution of previously seen priorities, dramatically reducing unnecessary resource consumption. For streaming SWOR, priorities are set as , , and only items that might enter the reservoir (exceeding current th‐largest key ) or that saturate a “level” (weight interval) are sent. Epoch transitions and level saturations are globally coordinated to synchronize threshold changes and redistribute responsibility across sites (Jayaram et al., 2019).
- Adaptive Refinement in Weight Space (“Structured Adaptive Sampling”): In MCDM, initial coarse weight grids identify regions on the Pareto front with large objective-space gaps (). Only those intervals are subdivided, biasing new samples to regions where the front is complex or currently under-represented. Sampling halts when either redundancy (ratio of distinct Pareto points to total weights tried) drops below a threshold or all interpolated distances fall below (Williams et al., 2024).
- Hash-based Decomposition for Distribution-Aware Model Sampling: For SAT, random low-independence XOR hashes partition the solution space into cells. Sampling is performed only within cells whose total weight lies in a target window , tailored to the current estimate of model-space “tilt” and available solver resources. This partitioning equalizes expected cell weight and optimally reduces enumeration complexity (Chakraborty et al., 2014).
- Domain Decomposition and Lifted Recursion: In symmetric FOL/C model sampling, complexity is managed via a two-tiered partition: assignment of 1-types (domain-level partitions), followed by 2-table (pairwise relation) assignment via domain recursion. Each step only processes configurations, with complexity controlled by the logic fragment and number of quantifiers (Wang et al., 2023).
- Alias/Index Structures for Constant-time Sampling: When sampling in weighted graphs (e.g., for network embedding), precomputed alias tables indexed by vertices, edges, or other context features enable per-sample effort independent of weight distribution complexity. This allows unbiased SGD updates despite highly skewed weighting (Chen et al., 2017).
These frameworks are unified by their explicit attempt to direct computational or statistical effort to “difficult” regions—whether these are high-weight, high-curvature, or low-redundancy—while suppressing wasteful or redundant effort in “simple” or flat regions.
3. Optimality Results, Complexity Bounds, and Lower Limits
Complexity-based weighted sampling strategies are designed to match or approach worst-case lower bounds for resource metrics under adversarial or pathological distributions. Significant results include:
- Distributed Weighted SWOR Communication: Total message complexity is
for sites, total weight , and sample size ; no factor remains to be closed, even in the unweighted case (Jayaram et al., 2019). The protocol is message-, space-, and time-optimal up to universal constants.
- Weighted Sum Estimation Sample Complexity: For proportional-only sampling, samples are necessary and sufficient for accuracy; for hybrid proportional+uniform, the bound is . Matching lower bounds are established under hypothesis-testing arguments between hard-to-distinguish weight mixtures (Beretta et al., 2021).
- MCDM Adaptive Sampling Cost: Systematic grid search scales as in the grid-depth and objective count , while complexity-based adaptive refinement only targets “curved” regions, yielding far fewer required weights for front coverage at equivalent resolution. The worst case, where all intervals must be subdivided, can be exponential in depth, but empirical results indicate when the front is simple (Williams et al., 2024).
- Weighted Model Sampling (SAT, Weighted Constrained Sampling): The expected number of SAT queries required by WeightGen is where is the tilt, but can be reduced to if weights are factored. Quantum algorithms achieve an oracle complexity , which is quadratically faster than the classical bound (Chakraborty et al., 2014, Riguzzi, 2024).
- Variance–Cost Trade-offs in Adaptive MCMC: For variable-complexity weighted Gibbs, variance of the Rao–Blackwellized estimator scales as for signal dimension, computational budget , and total iterations —formally quantifying the cost-variance trade-off (Truong, 2023).
These complexity guarantees render the strategies robust to worst-case scenarios and justify their aggressive biasing of resource allocation to “difficult” subspaces.
4. Algorithmic Instantiations and Pseudocode
The complexity-based weighted sampling regime encompasses several concrete algorithmic instantiations:
- Distributed Weighted SWOR (Jayaram et al., 2019):
- Priority assignment: For each incoming , draw and set .
- Epoch synchronization: Update the -th largest key at the coordinator; propagate new epoch when crosses -powers ().
- Level-based batch handling: Weight levels accumulate $4 r s$ unkeyed entries, after which batch retroactive key assignment and further regular keying commence.
- Structured Adaptive (Complexity-Based) Sampling for MCDM (Williams et al., 2024):
- Initialize weights by a coarse grid (Uniform Increment ).
- Solve each subproblem, collect nondominated points .
- While redundancy or coverage gap exceeds thresholds:
- Identify intervals with .
- Subdivide those intervals; sample new in subintervals.
- Update and repeat.
- Stop when all significant coverage gaps are eliminated or redundancy criterion met.
- Model Sampling via Hash Partitioning (Chakraborty et al., 2014):
- Estimate total solution weight via WeightMC.
- For grid of cell sizes , select random 3-wise independent hash ; for random target cell , enumerate all solutions in , conditional on residing in prescribed weight bounds.
- Uniformly (by weight) select and output a solution in the accepted cell.
- Domain-Recursion in FO/C (Wang et al., 2023):
- Enumerate all possible 1-type partitions (size configurations) of the domain.
- For each, recursively sample 2-tables (“cell–partition enumeration”) by induction, updating the problem at each step.
Pseudocode for each method is provided in the respective literature and is central for practical deployment and transferability.
5. Applications and Empirical Findings
Complexity-based weighted sampling strategies have found deployment across multiple application domains with empirical validation:
- Data Streams and Heavy Hitters: Used for tracking distributed heavy hitters with residual error, tracking, and dynamic frequency estimation in streaming data (Jayaram et al., 2019).
- Multi-Objective Optimization: MCDM scalarisation with adaptive weight refinement efficiently recovers diverse Pareto sets with minimal redundancy, outperforming naive or grid-only approaches in head-to-head computational studies (Williams et al., 2024).
- SAT and Model Sampling: In weighted model counting and sampling for probabilistic reasoning, hardware verification, and constraint satisfaction, distribution-aware and complexity-biased sampling algorithms scale to tens of thousands of variables with tight empirical agreement to theoretical frequency guarantees (Chakraborty et al., 2014).
- Graph Machine Learning: Weighted-vertex sampling via alias-structures is employed in large-scale graph embedding and recommendation (e.g., MovieLens, KKBOX), yielding both higher-quality embeddings and improved computational efficiency compared to uniform schemes (Chen et al., 2017).
- Kernel Approximation: Leverage-based weighted sampling in Random Fourier Features matches statistical guarantees of kernel approximation while dramatically reducing computation time (Liu et al., 2019).
Typically, such strategies not only improve computational efficiency but also attain provable representativity or unbiasedness, provided resource thresholds and stopping criteria are tuned according to theoretical guidance.
6. Trade-offs, Interpretation, and Practical Guidance
Implementing complexity-based weighted sampling involves explicit resource–accuracy trade-offs and parameter choices:
- Communication vs. Sample Quality: In streaming/reservoir sampling, increasing sample size relative to reduces communication cost, but only via slow logarithmic effects. Level-based batching is mandatory for handling heavy-tailed weight distributions (Jayaram et al., 2019).
- Adaptive Refinement Thresholds: In weighted-sum scalarisation, the choice of tolerance and redundancy governs the exploration-vs-exploitation balance; empirical evidence suggests starting with coarse grids and refining only high-gap regions is nearly optimal (Williams et al., 2024).
- Tilt and Partitioning: In hash-based SAT sampling, complexity scales linearly or logarithmically with tilt, depending on the weight representation; careful selection of partition parameters is required for efficiency and success probability (Chakraborty et al., 2014).
- Variance–Cost in MCMC: The per-iteration budget in variable-complexity weighted Gibbs sampling controls the variance–time trade-off, with empirical studies indicating as a practical rule for balancing resource usage and accuracy in high dimensions (Truong, 2023).
In summary, complexity-based weighted sampling strategies generalize and subsume fixed, uniform approaches by responding intelligently to provably “hard” structure in the weighting or solution space, achieving both optimality in resource metrics and practical advantage in real-world deployments across disciplines.