Unsupervised Sampling Strategies

Updated 17 April 2026

Unsupervised sampling strategies are algorithmic methods that select informative and diverse data points without labeled feedback.
Techniques include surprise sampling based on gradient norms, diversity sampling with DPPs, and learned proposals in particle filtering to minimize estimator variance.
These strategies are applied in large-scale learning, data summarization, and contrastive representation, though challenges like computational cost and hyperparameter sensitivity remain.

Unsupervised sampling strategies are algorithmic and statistical methods for selecting data subsets or sampling distributions in contexts where ground-truth labels, supervised feedback, or response variables are absent. Their goal is to maximize informativeness, diversity, efficiency, or estimation quality using only the intrinsic structure of the data or the workings of the unsupervised task’s optimization process. These strategies are essential in large-scale learning, data summarization, model approximation, representation learning, and structure discovery, spanning applications from kernel methods to dynamical system inference and contrastive representation learning.

1. Core Principles of Unsupervised Sampling

Unsupervised sampling is defined by the absence of explicit label or response feedback. Rather, data selection is governed by criteria derived from:

Statistical properties (e.g., variance, leverage, surprise measures)
Geometric structure (e.g., clustering, diversity)
Task-driven surrogates (e.g., loss gradients, modeled uncertainty)
Synthetic augmentation and feature manipulations

Optimal sampling policies are often justified through asymptotic efficiency, error minimization, or control of downstream estimator variance. Theoretical strategies such as “surprise sampling” maximize variance reduction by prioritizing points where the loss gradient norm is large relative to a local metric, and diversity maximization schemes (e.g., DPPs) promote spread in the sampled subset to avoid redundancy and increase coverage (Shen et al., 2020, Fanuel et al., 2020).

2. Adaptive and Learned Proposals in Particle Filtering

Sequential Monte Carlo methods, particularly particle filters, are sensitive to the choice of proposal or sampling distribution. Poorly matched, easy-to-sample proposals frequently cause weight degeneracy—where one sample has most of the total weight, inflating variance. “Unrolling Particles” introduces an unsupervised learning approach where the sequence of proposal kernels $q_t(x_t|x_{t-1},y_t)$ is parameterized as a multivariate normal with both mean and covariance adapted via neural networks (Gama et al., 2021). The core innovation is an algorithm-unrolling view: each time step is formulated as a differentiable layer, allowing unsupervised backpropagation to adjust the kernel to minimize normalized importance-weight skew: $J = \sum_{t=0}^{T-1} \sum_{k=1}^{K} \log \tilde w_t^{(k)}, \quad \tilde w_t^{(k)} = \frac{w_t^{(k)}}{\sum_m w_t^{(m)}}$ where $w_t^{(k)}$ is the particle weight at step $t$ for particle $k$ .

Learning is fully unsupervised: only the model likelihoods $p(y_t|x_t)$ and the available observation sequence are required. The approach yields superior accuracy across linear-Gaussian, nonlinear, and model-mismatch regimes compared to classical, hand-designed sampling distributions (Gama et al., 2021).

3. Subset Selection: Surprise and Diversity Sampling

Subsampling from large unsupervised datasets is often required for scalability or efficient estimation. Surprise sampling–based on gradient norm scoring–selects points in proportion to their influence on the estimator’s variance, as measured by the Hessian-weighted gradient length: $\pi^*(d) = \min\left\{ c \|A^{-1/2}g(d;\theta^*)\|, 1 \right\}$ where $g(d;\theta) = \nabla_\theta \ell(d;\theta)$ and $A = \mathbb{E}[\nabla^2_\theta \ell(D;\theta^*)]$ (Shen et al., 2020). This method provably minimizes asymptotic variance among all schemes with the same sampling rate, and works generically for any unsupervised learning problem in the form $\min_\theta \mathbb{E}\ell(D;\theta)$ , including PCA and $J = \sum_{t=0}^{T-1} \sum_{k=1}^{K} \log \tilde w_t^{(k)}, \quad \tilde w_t^{(k)} = \frac{w_t^{(k)}}{\sum_m w_t^{(m)}}$ 0-means.

Diversity-based approaches, particularly those using Determinantal Point Processes (DPPs), select subsets with maximized volume in feature or kernel space, ensuring spread (leverage score balancing) and tightly controlling approximation error in kernel methods such as the Nyström approximation (Fanuel et al., 2020). Greedy high-determinant swap heuristics are used for efficient DPP sampling in large-scale settings.

4. Structure-Aware and Dynamic Sampling Algorithms

For problems involving underlying latent structures (e.g., clustering, model fitting), efficient one-shot and dynamic sampling strategies have been derived:

The “one-time grab” algorithm computes the minimal uniform sample size $J = \sum_{t=0}^{T-1} \sum_{k=1}^{K} \log \tilde w_t^{(k)}, \quad \tilde w_t^{(k)} = \frac{w_t^{(k)}}{\sum_m w_t^{(m)}}$ 1 required (via hypergeometric tail bounds) to ensure, with probability $J = \sum_{t=0}^{T-1} \sum_{k=1}^{K} \log \tilde w_t^{(k)}, \quad \tilde w_t^{(k)} = \frac{w_t^{(k)}}{\sum_m w_t^{(m)}}$ 2, that all $J = \sum_{t=0}^{T-1} \sum_{k=1}^{K} \log \tilde w_t^{(k)}, \quad \tilde w_t^{(k)} = \frac{w_t^{(k)}}{\sum_m w_t^{(m)}}$ 3 latent structures are hit with at least $J = \sum_{t=0}^{T-1} \sum_{k=1}^{K} \log \tilde w_t^{(k)}, \quad \tilde w_t^{(k)} = \frac{w_t^{(k)}}{\sum_m w_t^{(m)}}$ 4 inliers, forming a reliable coreset for downstream inference (Jaberi et al., 2018). This reduces sample complexity by factors of 2–5 compared to prior methods.
In adaptive imaging, U-SLADS fits a hierarchical GMM to measured intensities in dynamic imaging tasks (e.g., dendritic growth). Each batch of new samples is chosen to most effectively propagate and refine the current mixture components, focusing acquisition on regions of high variance or scientific interest without supervision (Zhang et al., 2018).

5. Sampling in Contrastive and Representation Learning

Contrastive learning relies critically on the selection and construction of positive and negative pairs, particularly in graph and text domains. Unsupervised strategies address several issues:

In sentence representations, ClusterNS uses per-batch K-means clustering to identify hard negatives (nearest neighbor clusters) and false negatives (samples within the same cluster), applying dedicated loss components for both (Deng et al., 2023).
In graph unsupervised learning, negative sampling is refined via pseudo-label margin filtering to avoid class collision, DPPs for diversity among negatives, and explicit importance weighting of hard negatives (Chen et al., 2021). MeCoLe further introduces virtual node generation: negatives are not merely random distant points, but synthetic constructions differing only in class-dependent features, thereby producing hard negative pairs that are maximally informative for partition boundaries (Cui et al., 2024).
In unsupervised speech learning, flattening the sampling distribution over discrete types (Zipf compression) improves pair coverage and overall sample efficiency (Riad et al., 2018).

6. Advanced Strategies: Rare Event and Trajectory Sampling

Rare event simulation, molecular generation, and implicit function copying require sampling strategies that focus on high-value, low-probability outcomes:

For rare event path sampling, FlowRES learns an unsupervised normalizing flow proposal embedded in a Metropolis–Hastings scheme to generate efficient (nonlocal) proposals while preserving unbiasedness. All training is unsupervised, and the method obviates the need for collective variables or prior data (Asghar et al., 2024).
In molecular diffusion generative models, the StoMax (maximally stochastic) sampler discards ancestral (Markov) mean interpolation and instead re-estimates $J = \sum_{t=0}^{T-1} \sum_{k=1}^{K} \log \tilde w_t^{(k)}, \quad \tilde w_t^{(k)} = \frac{w_t^{(k)}}{\sum_m w_t^{(m)}}$ 5 afresh at each step, injecting maximal noise and avoiding the drift and error accumulation of standard samplers. StoMax yields significant improvements in chemical validity and stability of generated molecules (Ni et al., 19 Jun 2025).
In copying classifiers with only black-box oracle access, boundary- and variance-aware (Fast Bayesian) sampling empirically minimizes the number of queries needed to achieve high-fidelity mimics, outperforming naive random or localized Jacobian-based strategies (Unceta et al., 2019).

7. Empirical Impact and Limitations

Unsupervised sampling strategies consistently deliver concrete benefits:

Domain	Strategy Example	Main Benefit
Particle Filtering	Unrolled, learned proposal	Robustness, accuracy
PCA, $J = \sum_{t=0}^{T-1} \sum_{k=1}^{K} \log \tilde w_t^{(k)}, \quad \tilde w_t^{(k)} = \frac{w_t^{(k)}}{\sum_m w_t^{(m)}}$ 6-means, Regression	Surprise/DPP sampling	5–10 $J = \sum_{t=0}^{T-1} \sum_{k=1}^{K} \log \tilde w_t^{(k)}, \quad \tilde w_t^{(k)} = \frac{w_t^{(k)}}{\sum_m w_t^{(m)}}$ 7 speedup, same accuracy
Graph contrastive learning	Hard negative sampling	$J = \sum_{t=0}^{T-1} \sum_{k=1}^{K} \log \tilde w_t^{(k)}, \quad \tilde w_t^{(k)} = \frac{w_t^{(k)}}{\sum_m w_t^{(m)}}$ 810–15 ppt clustering accuracy
Diffusion generative models	StoMax	$J = \sum_{t=0}^{T-1} \sum_{k=1}^{K} \log \tilde w_t^{(k)}, \quad \tilde w_t^{(k)} = \frac{w_t^{(k)}}{\sum_m w_t^{(m)}}$ 94–8% stability/validity
Copying predictors	Boundary/Bayesian	Lower sample complexity

Benefits are most pronounced in data-efficient, structure-extraction, or long-tail scenarios, and less so when label or diversity constraints are relaxed or when computational/architectural bottlenecks intervene.

Limitations span computational cost (e.g., inverting Hessians, DPP sampling, and Normalizing Flow training), sensitivity to hyperparameters (e.g., leverage cutoffs, batch sizes), and the requirement for meaningful features (clustering-based selection declines when input space has poor semantic organization). The theory remains incomplete for sampling in non-Euclidean, conditional, or highly multi-modal settings.

8. Theoretical Foundations and Future Directions

Foundational results demonstrate that many unsupervised sampling strategies attain minimax-optimality under natural constraints:

Surprise sampling minimizes asymptotic variance under Horvitz–Thompson or similar weighted estimators (Shen et al., 2020).
DPP diversity yields tight data-dependent bounds on trace and spectral error for subsampled kernel matrices (Fanuel et al., 2020).
Learned proposal mechanisms in particle filters asymptotically minimize weight degeneracy, outperforming classical minimum-variance kernels when likelihoods are misspecified (Gama et al., 2021).

Emerging trends include hybrid and adaptive samplers interpolating between deterministic and maximally stochastic limits (as in StoMax), unsupervised generative sampling for rare events and long-tail prediction, and efficient negative sampling for high-dimensional contrastive objectives.

Open questions include scaling DPP and hard-negative selection to extreme data regimes, synthesizing theory for adaptive variance schedules in diffusion models, and integrating unsupervised sampling in fully online or streaming settings.

References:

(Gama et al., 2021) "Unrolling Particles: Unsupervised Learning of Sampling Distributions"
(Shen et al., 2020) "Surprise sampling: improving and extending the local case-control sampling"
(Fanuel et al., 2020) "Diversity sampling is an implicit regularization for kernel methods"
(Asghar et al., 2024) "Efficient Rare Event Sampling with Unsupervised Normalising Flows"
(Deng et al., 2023) "Clustering-Aware Negative Sampling for Unsupervised Sentence Representation"
(Ni et al., 19 Jun 2025) "Revisiting Sampling Strategies for Molecular Generation"
(Jaberi et al., 2018) "Sparse One-Time Grab Sampling of Inliers"
(Roy et al., 2022) "Impact of Strategic Sampling and Supervision Policies on Semi-supervised Learning"
(Chen et al., 2021) "Probing Negative Sampling Strategies to Learn GraphRepresentations via Unsupervised Contrastive Learning"
(Cui et al., 2024) "Unsupervised node clustering via contrastive hard sampling"
(Riad et al., 2018) "Sampling strategies in Siamese Networks for unsupervised speech representation learning"
(Unceta et al., 2019) "Sampling Unknown Decision Functions to Build Classifier Copies"
(Zhang et al., 2018) "U-SLADS: Unsupervised Learning Approach for Dynamic Dendrite Sampling"