Exploratory Sampling (ESamp) Overview

Updated 4 July 2026

Exploratory Sampling (ESamp) is a guided sampling strategy that targets underrepresented but significant regions in structured search spaces.
It is applied in domains such as large language model decoding, inverse learning, and robotic recovery to enhance diversity and robustness.
ESamp preserves multimodality, neighborhood structure, and calibrated uncertainty, tailoring sampling methods to task-specific constraints.

Exploratory Sampling (ESamp) denotes a class of sampling procedures in which the sampling rule is used to improve exploration of a structured space rather than merely to generate interchangeable random draws. In the most explicit recent usage, ESamp is a test-time decoding method for LLMs that reweights token candidates by a latent novelty signal derived from depth-wise prediction error, with the stated aim of producing semantically diverse generations rather than only lexical variation (Zeng et al., 27 Apr 2026). Across adjacent literatures, closely related mechanisms appear in inverse learning, interactive pattern mining, landscape analysis, network traversal, sequential balanced sampling, posterior uncertainty propagation, and robotic data augmentation, where the shared concern is to preserve multimodality, neighbourhood structure, user interest, calibrated uncertainty, or recovery behavior under constrained sample budgets (Zhang et al., 2022).

1. Terminological range and scope

The term is not used uniformly across the literature. One paper explicitly introduces Exploratory Sampling (ESamp) as a decoding approach for LLMs (Zeng et al., 27 Apr 2026). Other papers use neighboring but non-identical formulations. In network analysis, the relevant category is exploration-based sampling, defined by traversal through local connectivity rather than independent node or edge selection; representative methods include random walks, Metropolis–Hastings walks, and snowball sampling (Nguyen, 24 Apr 2025). In ensemble learning, Evolutionary Sampling (ES) denotes a genetic-algorithm search over training-set samples or feature subspaces and is explicitly distinguished from Exploratory Sampling (Nisar et al., 2016).

This terminological dispersion matters because the objects being explored differ sharply across domains. In the large-language-model setting, the explored object is a decoding trajectory in latent representation space. In landscape analysis, it is a sequence of sampled candidate solutions over a bounded search space. In inverse learning, it is the posterior support of a one-to-many inverse map. In pattern mining, it is a constrained combinatorial pattern family. In robotics, it is an out-of-distribution recovery manifold induced by policy rollouts. This suggests that ESamp is best understood as a research motif—guided sampling for exploration under task-specific constraints—rather than a single universally standardized algorithm.

2. ESamp as latent-distilled decoding for LLMs

In "LLMs Explore by Latent Distilling" (Zeng et al., 27 Apr 2026), ESamp is defined as a test-time decoding method that explicitly encourages semantic diversity during generation. The method is motivated by the observation that standard stochastic decoding often yields surface-form variation while remaining trapped in a small set of underlying reasoning paths. ESamp addresses this by training a lightweight Latent Distiller online during decoding to predict the final-layer hidden representation from an early-layer hidden representation. If $h_t^1$ is the hidden state after the first transformer layer and $h_t^L$ is the final-layer hidden state, the Distiller predicts

$\hat{h}_t^L = f_\phi(h_t^1),$

and is trained with the mean-squared-error objective

$\mathcal{L}(\phi) = \frac{1}{|B|}\sum_{i \in B}\|h_{t,i}^L - f_\phi(h_{t,i}^1)\|_2^2.$

The prediction error becomes a novelty signal. Let $\pi_{\mathrm{ref}}$ be the base model’s token distribution and $q_{\mathrm{dist}}$ the token distribution induced by the Distiller’s predicted hidden state. ESamp defines the intrinsic reward for token $z$ under state $s$ as

$r(s,z) = \log \pi_{\mathrm{ref}}(z \mid s) - \log q_{\mathrm{dist}}(z \mid s),$

and plugs this into a KL-regularized policy objective. The resulting decoding rule is

$\pi_{\mathrm{new}}(z \mid s) \propto \pi_{\mathrm{ref}}(z \mid s)\exp(\beta r(s,z)) \propto \frac{\pi_{\mathrm{ref}}(z \mid s)^{1+\beta}}{q_{\mathrm{dist}}(z \mid s)^\beta}.$

In logit space, the update takes the form

$h_t^L$ 0

so the exploration term is aligned with the latent error vector $h_t^L$ 1. The paper interprets $h_t^L$ 2 as a novelty magnitude and the cosine term $h_t^L$ 3 as a semantic-direction factor.

The implementation is explicitly asynchronous. The Distiller is a 2-layer MLP with gated SwiGLU blocks, hidden size 384, and residual connections. Its forward pass is overlapped with the main transformer computation, and its update is deferred to slack periods. Reported overhead is less than 5% in the worst case, with 1.2% in the optimized release; the main paper reports 0.3% for $h_t^L$ 4, 1.81% for $h_t^L$ 5, and 4.25% for $h_t^L$ 6. Empirically, the method is evaluated on AIME 2024, AIME 2025, GPQA-Diamond, LiveCodeBench v5, and creative-writing continuation. The reported results include improved Pass@ $h_t^L$ 7 efficiency for reasoning models, strong generalization across mathematics, science, and code generation, and a creative-writing table in which ESamp attains Vendi $h_t^L$ 8, similarity $h_t^L$ 9, and perplexity $\hat{h}_t^L = f_\phi(h_t^1),$ 0, while on AIME25 it reports Pass@16 $\hat{h}_t^L = f_\phi(h_t^1),$ 1 with Vendi $\hat{h}_t^L = f_\phi(h_t^1),$ 2 (Zeng et al., 27 Apr 2026).

3. Guided exploration in inverse learning, interactive mining, and robotic recovery

A closely related use of exploratory sampling appears in "Accelerating Inverse Learning via Intelligent Localization with Exploratory Sampling" (Zhang et al., 2022). There, the task is inverse learning under a one-to-many forward operator $\hat{h}_t^L = f_\phi(h_t^1),$ 3, where the challenge is to localize valid inverse solutions without missing disconnected modes. The method, iPage, uses an invertible neural network to induce an approximate posterior over $\hat{h}_t^L = f_\phi(h_t^1),$ 4, but replaces naïve latent sampling with Latin Hypercube Sampling and, in the best-performing variant, maximin LHS in latent space. The latent prior is $\hat{h}_t^L = f_\phi(h_t^1),$ 5, posterior candidates are generated via $\hat{h}_t^L = f_\phi(h_t^1),$ 6, and maximin LHS is defined by

$\hat{h}_t^L = f_\phi(h_t^1),$ 7

The sampled posterior candidates serve as “intelligent priors” for subsequent gradient-based localization. On the 2D sinewave benchmark, the paper reports that iPage with simple random sampling still missed some modes, whereas maximin LHS covered all 9 local modes.

In exploratory data mining, "Learning what matters - Sampling interesting patterns" (Dzyuba et al., 2017) formulates exploration as an interactive sampling loop. The LETSIP system combines weighted constrained sampling with preference learning under the Mine, Interact, Learn, Repeat framework. Patterns are sampled with probability proportional to a learned quality function,

$\hat{h}_t^L = f_\phi(h_t^1),$ 8

subject to the pattern constraints. The user provides total orders over small sampled queries; preferences are converted into pairwise examples and fitted with Stochastic Coordinate Descent under $\hat{h}_t^L = f_\phi(h_t^1),$ 9-regularized logistic loss. The system explicitly treats the exploration–exploitation trade-off as a design variable, introducing both $\mathcal{L}(\phi) = \frac{1}{|B|}\sum_{i \in B}\|h_{t,i}^L - f_\phi(h_{t,i}^1)\|_2^2.$ 0 cell sampling and query retention. The reported conclusion is that LETSIP provides efficient and interleaved learning and sampling, user-specific anytime exploration, and favorable trade-offs concerning quality-diversity and exploitation-exploration.

In robotic manipulation, "RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation" (Xue et al., 20 Oct 2025) uses exploratory sampling to synthesize out-of-distribution recovery data from expert-only imitation-learning corpora. An offline RL critic $\mathcal{L}(\phi) = \frac{1}{|B|}\sum_{i \in B}\|h_{t,i}^L - f_\phi(h_{t,i}^1)\|_2^2.$ 1 identifies actions that are likely under the current policy but low-value under the critic. Candidate actions are drawn from the policy, and the exploratory subset is

$\mathcal{L}(\phi) = \frac{1}{|B|}\sum_{i \in B}\|h_{t,i}^L - f_\phi(h_{t,i}^1)\|_2^2.$ 2

If this set is non-empty, the algorithm executes the highest-likelihood low-value action, thereby steering rollouts into likely failure regions. The generated trajectories are added to the training set as an OOD recovery dataset. On LIBERO-Spatial with a DiT Policy backbone, the ablation reports 68.5 for the raw strategy, 74.8 for augmentation without sampling, 73.0 for augmentation with random sampling, and 76.5 for the proposed method; a separate mixing-ratio study reports the best result at 40% augmented data.

4. Landscape analysis and black-box optimization

In black-box optimization, exploratory sampling is central because landscape features are estimated from a finite set of sample points rather than from an explicit analytic representation. "Exploratory Landscape Analysis is Strongly Sensitive to the Sampling Strategy" (Renau et al., 2020) studies ELA feature approximation for $\mathcal{L}(\phi) = \frac{1}{|B|}\sum_{i \in B}\|h_{t,i}^L - f_\phi(h_{t,i}^1)\|_2^2.$ 3 using sample sets $\mathcal{L}(\phi) = \frac{1}{|B|}\sum_{i \in B}\|h_{t,i}^L - f_\phi(h_{t,i}^1)\|_2^2.$ 4 processed by flacco. The study uses 46 features from six groups—dispersion, information content, nearest-better clustering, meta-model, $\mathcal{L}(\phi) = \frac{1}{|B|}\sum_{i \in B}\|h_{t,i}^L - f_\phi(h_{t,i}^1)\|_2^2.$ 5-distribution, and PCA—and compares five sampling designs: pseudo-random uniform sampling with Mersenne Twister, RANDU, LHS, improved LHS, and Sobol' sequences. Sample sizes are $\mathcal{L}(\phi) = \frac{1}{|B|}\sum_{i \in B}\|h_{t,i}^L - f_\phi(h_{t,i}^1)\|_2^2.$ 6. The principal result is that larger $\mathcal{L}(\phi) = \frac{1}{|B|}\sum_{i \in B}\|h_{t,i}^L - f_\phi(h_{t,i}^1)\|_2^2.$ 7 reduces dispersion of estimated features and improves classification accuracy, but feature approximations from different sampling methods do not converge to the same value as $\mathcal{L}(\phi) = \frac{1}{|B|}\sum_{i \in B}\|h_{t,i}^L - f_\phi(h_{t,i}^1)\|_2^2.$ 8 grows. The paper therefore concludes that ELA feature values cannot be interpreted independently of the sampling strategy, and that the sampling method used in training must match that used at deployment. As a side result, classifiers trained on features approximated by Sobol' sequences achieve the highest median and mean accuracy across the reported settings.

The neighbourhood-sampling problem is addressed directly in "Hilbert curves for efficient exploratory landscape analysis neighbourhood sampling" (Pienaar et al., 2024). Information-content features require a sequence of neighbouring solutions, so the sample must be both spatially correlated and broadly distributed. The paper proposes Hilbert space-filling curves both as samplers and as ordering devices. A space-filling curve is defined as a surjective continuous function $\mathcal{L}(\phi) = \frac{1}{|B|}\sum_{i \in B}\|h_{t,i}^L - f_\phi(h_{t,i}^1)\|_2^2.$ 9, and for a Hilbert curve of order $\pi_{\mathrm{ref}}$ 0 in dimension $\pi_{\mathrm{ref}}$ 1, the number of vertices is $\pi_{\mathrm{ref}}$ 2. Because that growth is exponential, the practical method randomly sub-samples from the vertices of a Hilbert curve with minimum order 3. For ordering an LHS sample of size $\pi_{\mathrm{ref}}$ 3, the curve order is set to

$\pi_{\mathrm{ref}}$ 4

The paper reports that Hilbert-curve and LHS sampling have similar Hausdorff distances, that random walk performs worst, and that the null hypothesis of no difference is rejected in the reported tests. In classification experiments on the 24 BBOB functions, Hilbert sampling is competitive with LHS and markedly better than random walk; with random forest, the reported overall accuracies are 97.38% for Hilbert, 97.64% for LHS, and 93.60% for random walk. As an ordering strategy, Hilbert ordering is significantly faster than nearest-neighbour ordering without sacrificing the saliency of the extracted information-content features.

5. Exploration over graphs, streams, posteriors, and adaptive networks

Graph sampling provides another important setting in which exploration is topological rather than Euclidean. "Network Sampling: An Overview and Comparative Analysis" (Nguyen, 24 Apr 2025) formalizes the general objective as producing a sampled subgraph $\pi_{\mathrm{ref}}$ 5 from a graph $\pi_{\mathrm{ref}}$ 6 such that degree distribution, clustering coefficient, or shortest-path length are preserved within acceptable error bounds. The paper distinguishes node-based, edge-based, and exploration-based sampling. The exploration-based class is traversal-driven and includes random walks, Metropolis–Hastings random walk sampling, and snowball sampling. These methods are described as especially useful for large, evolving, and partially observable networks, but also as sensitive to seed choice and biased toward dense or high-centrality regions. The empirical conclusion is that no single method consistently outperforms the others: advanced methods tend to be stronger on static networks, while simpler methods can be more effective on temporal networks.

Sequential designs arise when the population is not fully known in advance. "Sequential Spatially Balanced Sampling" (Jauslin et al., 2021) proposes a new algorithm that sequentially selects a balanced sample while respecting equal and unequal inclusion probabilities, and that can also produce a spatially balanced sample when coordinates are available. The method updates inclusion probabilities inside a moving pool of currently available units by solving a constrained linear program so that balancing equations remain satisfied when the current unit is set to 0 or 1. If spatial coordinates are observed, the pool is reordered by distance so that selecting a current unit pushes nearby units away from subsequent selection. The paper’s simulation study reports that the proposed method outperforms other methods.

In uncertainty-aware NLP, exploratory sampling is tied to posterior calibration. "Posterior calibration and exploratory analysis for natural language processing models" (Nguyen et al., 2015) argues that posterior probabilities should be evaluated directly for calibration and then propagated into exploratory data analysis. The paper defines an RMS calibration error and estimates it with adaptive binning; for downstream sampling it exploits a coreference model whose antecedent decisions factorize as

$\pi_{\mathrm{ref}}$ 7

Because of this factorization, antecedents can be sampled independently and converted to clusterings by connected components, yielding exact independent samples from the posterior over clusterings. Pairwise coreference probabilities are estimated from 1000 posterior samples, and the paper reports less than 1% calibration error for the coreference model. Those samples are then pushed through a deterministic event-extraction pipeline to obtain posterior distributions over country-level event counts, summarized as $\pi_{\mathrm{ref}}$ 8.

A distinct biological formulation appears in "Exploratory Adaptation in Large Random Networks" (Schreier et al., 2016). There, exploratory sampling is implemented as a mismatch-gated random walk in network interaction strengths. The system state obeys

$\pi_{\mathrm{ref}}$ 9

with phenotype $q_{\mathrm{dist}}$ 0 constrained to lie within a comfort zone around $q_{\mathrm{dist}}$ 1. When the mismatch function $q_{\mathrm{dist}}$ 2 is nonzero, the interaction matrix performs stochastic drift,

$q_{\mathrm{dist}}$ 3

Exploration ceases when a stable attractor satisfies the constraint. The main structural result is that successful convergence in high dimensions requires outgoing network hubs and is enhanced by their auto-regulation.

6. Methodological themes, distinctions, and recurrent misconceptions

A recurrent misconception is to equate exploratory sampling with unguided randomness. The cited literature points in the opposite direction. ESamp for LLMs uses a latent novelty signal derived from structured depth-wise prediction error. iPage uses space-filling maximin LHS in latent space rather than naïve random draws. LETSIP learns a user-specific sampling distribution from ordered feedback. RESample filters policy actions through an offline critic before rollout. ELA work shows that even apparently simple choices such as LHS, Sobol', or uniform pseudo-random sampling induce systematically different feature distributions. This suggests that exploratory sampling is typically about how to bias sampling, not about replacing structure with noise.

A second misconception is to treat all named variants as interchangeable. "Evolutionary Sampling" searches the space of training subsets or feature subspaces with a standard genetic algorithm implemented in DEAP, using population size 30, ensemble size 10, and 30 generations, and evaluates candidate ensembles with fitness functions FEMPO, FEMPT, and FEGT (Nisar et al., 2016). "SWAY," by contrast, is a divide-and-conquer baseline optimizer for search-based software engineering that begins with 10,000 randomly generated candidates and recursively prunes them by geometric splitting, achieving roughly $q_{\mathrm{dist}}$ 4 candidate evaluations in the paper’s account (Chen et al., 2016). Both are sampling-centered exploration strategies, but neither is the same method as ESamp in latent-distilled LLM decoding.

A third recurring issue is deployment mismatch. In ELA, classifiers work best when training and testing use the same sampling strategy, and cross-strategy mismatch can substantially reduce accuracy (Renau et al., 2020). In network sampling, the best method depends on whether the target is static or temporal and on which graph properties must be preserved (Nguyen, 24 Apr 2025). In robotics, the amount of augmented exploratory data must be tuned, with 40% reported as best in the studied setting (Xue et al., 20 Oct 2025). Inference from these studies should therefore distinguish between exploration quality, preservation target, and downstream task. The preservation target varies by domain: semantic diversity with coherence in text generation, mode coverage in inverse problems, quality-diversity in pattern mining, neighbourhood structure in landscape analysis, calibrated uncertainty in NLP, component structure in networks, balance and spatial spread in sequential sampling, and recovery behavior in robotic control.

Taken together, these works indicate that exploratory sampling is a domain-general methodological pattern whose most precise implementations are domain-specific. The exact object being explored may be a token distribution, a latent posterior, a pattern language, a graph, a stream, or a dynamical system. What remains stable across these settings is the use of task-aware sampling to expose underrepresented but consequential regions of the search space while preserving some notion of usefulness, validity, or interpretability.