Local Entropy Search (LES) Overview

Updated 1 December 2025

Local Entropy Search (LES) is an approach that quantifies the log-density of nearby solutions to favor flat, robust basins over isolated sharp optima.
LES techniques employ Gaussian-averaged free energies and modeled descent sequences to enhance sample efficiency and reduce regret in Bayesian optimization and constraint satisfaction problems.
By regularizing the optimization landscape, LES improves algorithmic tractability in combinatorial tasks and generalization in deep neural network training.

Local Entropy Search (LES) is a family of algorithms and analysis techniques that leverage local entropy—the log-density or free energy of solutions in a neighborhood of the state or parameter space—to bias search (optimization or sampling) toward “flat” or dense basins, as opposed to isolated sharp optima. LES methodologies have been developed for Bayesian optimization, constraint satisfaction problems, discrete sampling, and deep neural network training, providing both algorithmic tools and theoretical insights into the structure and tractability of complex energy landscapes.

1. Local Entropy: Formalism and Motivations

Local entropy quantifies the log-weighted volume of solutions or low-cost configurations in the proximity of a specified reference point. For a generic differentiable loss $L(\theta)$ with parameters $\theta\in\R^d$ , the isotropic local entropy is given by a Gaussian-averaged free energy:

$S_{\mathrm{iso}}(\theta;\beta,\sigma) = -\frac{1}{\beta} \log \int_{\R^d} \exp\bigl[-\beta L(\theta+\xi)\bigr]\,\mathcal{N}(\xi;0, \sigma^2 I)\,d\xi$

or, equivalently, using $\gamma=1/\sigma^2$ ,

$F(\theta; \beta,\gamma) = - \log \int_{\R^d} \exp\left[-\beta L(w') - \frac{\gamma}{2}\|w' - \theta\|^2\right]\,dw'$

In discrete spaces, for example for $x \in \{0,1\}^d$ and log-probability $U(x)$ , the local entropy centered at $z$ is:

$S(z;\eta) = \log \sum_{x} \exp\left\{U(x) - \frac{1}{2\eta} \|x-z\|^2\right\}$

These forms measure not only the “depth” of cost/energy at $\theta$ but also the “flatness” in its neighborhood, favoring solutions robust under local perturbations. This concept has been exploited across domains:

Training neural networks to seek flat minima associated with better generalization (Chaudhari et al., 2016, Musso, 2020),
Designing solvers for CSPs that target ultra-dense solution clusters (Baldassi et al., 2015, Stojnic, 24 Jun 2025),
Crafting acquisition functions for sample-efficient local Bayesian optimization (Stenger et al., 24 Nov 2025),
Steer discrete samplers toward “flat modes” in combinatorial optimization or generative modeling (Mohanty et al., 5 May 2025).

2. LES in Bayesian Optimization: Descent-Sequences and Mutual Information

In “Local Entropy Search over Descent Sequences for Bayesian Optimization,” LES is defined as a sample-efficient, local Bayesian optimization strategy that explicitly models the probability distribution over descent sequences produced by a local optimizer $\mathcal{O}$ (e.g., gradient descent) acting on the Gaussian process (GP) surrogate posterior (Stenger et al., 24 Nov 2025).

Central to this approach is the acquisition function

$\alpha_{\mathrm{LES}}(x) = I\bigl( (x, y(x)); Q_{x_0}\mid \mathcal{D}_t \bigr)$

where $Q_{x_0}$ is the random descent sequence trajectory under the optimizer $\mathcal{O}$ initialized at $x_0$ , and $\mathcal{D}_t$ the current data. This mutual information can be written as

$I\bigl((x,y);Q\mid\mathcal{D}_t\bigr) = H\bigl[y(x)\mid\mathcal{D}_t\bigr] - \E_{f\sim \mathrm{GP}(\mathcal{D}_t)}\left[ H\bigl( y(x)\mid \mathcal{D}_t, Q_{x_0}(f) \bigr) \right]$

Using posterior function samples, descent trajectories are simulated under $\mathcal{O}$ , and the acquisition function is efficiently computed using analytic GP entropy and Monte Carlo averaging. Key empirical findings include:

Markedly lower simple and cumulative regret on high-complexity, high-dimensional synthetic and real-world benchmarks relative to global/other local BO methods.
Theoretical guarantees: density of query points; a rigorous probabilistic stopping rule based on local optimality certificates.
Practicalities: supports any differentiable local optimizer, is robust to inner heuristics (conditioning on function values vs. gradients), and is extensible to batched, constrained, or multi-fidelity BO.

LES, as applied here, operationalizes the entropy search principle not globally (“where is the optimum?”) but over the local descent basin, directly reducing uncertainty over reachable optima via descent (Stenger et al., 24 Nov 2025).

3. Local Entropy Search in Constraint Satisfaction and Combinatorial Landscapes

LES has been developed as an explicit solver framework for CSPs in the form of Entropy-driven Monte Carlo (EdMC) (Baldassi et al., 2015). Here, local entropy is defined as the logarithmic density of solutions at a fixed overlap $S$ with a reference configuration $x̃$ :

$\mathcal{M}(S) = \frac{1}{N}\langle\log N(x̃^*, S)\rangle$

where

$N(x̃, S) = \sum_x \left[\prod_\mu \psi_\mu(x_{\partial_\mu})\right] \delta(x \cdot x̃ - S N)$

Large-deviation analysis of the binary perceptron problem reveals that, for constraint densities $\alpha < \alpha_c$ , ultra-dense subdominant solution clusters exist, even when isolated global solutions are algorithmically inaccessible. EdMC, using a Metropolis zero-temperature search on $F(x̃, \gamma)$ , efficiently identifies these clusters, outperforming both standard Simulated Annealing and belief-propagation-based solvers for both perceptron and random $K$ -SAT.

Algorithmically, EdMC proceeds by:

Picking a reference $x̃$ .
Running BP to estimate $F(x̃, \gamma)$ .
Proposing local flips; accepting based on the change in $F$ .
Applying “scoping” (increasing $\gamma$ ) and optional annealing on temperature.

Empirically, EdMC scales polynomially in the size $N$ where SA scales exponentially or fails. For asymmetric binary perceptrons, analysis using lifted random duality theory (fl RDT and sfl LD RDT) confirms that the LES breakdown closely matches the empirical algorithmic threshold for tractability: as constraint density $\alpha$ increases past a narrow window ( $0.77 < \alpha < 0.78$ ), clusters with positive local entropy vanish, leading to computational hardness (Stojnic, 24 Jun 2025). This supports the interpretation that rare, exponentially large dense clusters—not typical isolated solutions—govern the practical tractability of large CSPs (Baldassi et al., 2015, Stojnic, 24 Jun 2025).

4. LES in Discrete Sampling and Combinatorial Optimization

Local entropy has been adopted as a regularizer in Markov Chain Monte Carlo for discrete spaces to favor “flat” (voluminous) modes, important in robust combinatorial optimization and probabilistic modeling (Mohanty et al., 5 May 2025). The Entropic Discrete Langevin Proposal (EDLP) introduces a continuous auxiliary variable $z$ tied to the discrete state $x$ , creating a joint distribution:

$\pi(x, z) \propto \exp\left\{U(x) - \frac{1}{2\eta}\|x-z\|^2\right\}$

Sampling is performed by alternating (i) discrete proposals biased by the local-entropy gradient and (ii) Langevin (Gaussian) updates in the auxiliary variable. This strategy:

Directs the chain toward neighborhoods with high local entropy (robust, flat modes) rather than narrow isolated peaks.
Guarantees non-asymptotic convergence under locally log-concave assumptions, and admits exact mixing bounds for both uncorrected (EDULA) and Metropolis-Hastings-corrected (EDMALA) variants.
Empirically dominates standard discrete Langevin and Metropolis samplers for synthetic Bernoulli distributions, restricted Boltzmann machines, TSPs, and Bayesian neural networks, specifically in capturing wide, robust solution sets (Mohanty et al., 5 May 2025).

LES variants in this class thus change the exploration/transitions of MCMC to reflect not just pointwise probability but also the local “entropic volume” associated with each state.

5. LES as Optimization Regularizer: Deep Learning and Anisotropy

In deep learning, local entropy objectives have been proposed and extensively analyzed as regularizers that bias training toward wide optima, empirically improving generalization and robustness (Chaudhari et al., 2016, Musso, 2020). The basic approach is to modify standard SGD to optimize the local-entropy-averaged loss $F(x; \gamma)$ , as detailed above.

The gradient of this local entropy is given by

$\nabla_x F(x; \gamma) = \frac{1}{\gamma} \mathbb{E}_{w \sim p(w|x)}[w - x],\quad p(w|x) \propto \exp\left(-f(w) - \frac{1}{2\gamma}\|w-x\|^2\right)$

This expectation is approximated using inner-loop Langevin dynamics (as in SGLD), leading to Entropy-SGD:

Outer loop: SGD on $x$ with gradient $x - \mu$ (where $\mu$ is the Langevin mean).
Inner loop: $L$ steps of SGLD on $w$ initialized at $x$ , using the composite energy $f(w) + \frac{1}{2\gamma}\|w-x\|^2$ .

Partial local entropy regularization refines this framework by smoothing only along selected, typically high-anisotropy directions (e.g., per-layer or blockwise), using a block-diagonal covariance $\Sigma_p$ to define the smoothing kernel (Musso, 2020). This setup:

Adapts regularization to the observed anisotropic gradient noise statistics of modern networks.
Confers persistent test-time accuracy gains when applied to select layers (e.g., second or last, non-convolutional layers), with negligible or negative impact when smoothing sharp or structurally-constrained parameters.

Empirical studies confirm that layerwise temperature (the variance of gradients within a layer) decays jointly across all layers late in training, and that partial LES can be tuned to these anisotropic dynamics for optimal regularization benefit (Musso, 2020).

6. Algorithmic and Practical Considerations

Across domains, the implementation of LES requires choices (reference points, hyperparameters) and tradeoffs:

Domain	LES Reference / Region	Main Computational Step	Practical Notes
Bayesian Optimization	Current incumbent; descent trajectory	GP posterior propagation, entropy computation	Local optimizer (GD/ADAM/CMA-ES) interchangeable (Stenger et al., 24 Nov 2025)
CSP/SAT/Perceptron	Reference state at overlap S	BP, Metropolis/collective moves on $F(x̃,\gamma)$	Soft/hard constraint focus; scoping and annealing (Baldassi et al., 2015)
Discrete Sampling	Auxiliary anchor variable $z$	Langevin update in $z$ , entropy-biased discrete proposals	MCMC mixing proved for log-concave settings (Mohanty et al., 5 May 2025)
Neural Nets	Parameter vector or sub-block	Inner SGLD loop, entropy gradient over local region	Smoothing covariance tailored to layerwise anisotropy (Chaudhari et al., 2016, Musso, 2020)

Key parameters include the scope (radius) of the entropy measure ( $\sigma$ , $\gamma$ , $\eta$ ), number of samples for expectation or descent sequence simulation, and, for partial entropy, selection of subspaces (layers or coordinates). Computational cost is dominated by the stochastic inner loop (SGLD, BP, or optimizer trajectories), but scale is controllable via parallelism, subsampling, or analytic entropy evaluations.

7. Theoretical Insights and Implications

LES rigorously relates tractability and performance to the existence of large, dense solution clusters in complex stochastic landscapes:

For CSPs, high local entropy near solution clusters correlates with polynomial-time tractability; its breakdown marks the onset of hard phases where only isolated solutions remain (Baldassi et al., 2015, Stojnic, 24 Jun 2025).
In Bayesian optimization, the density of queries and mutual information decrease yield an $(\varepsilon,\delta)$ certificate of local optimality and facilitate principled stopping rules (Stenger et al., 24 Nov 2025).
Discrete sampling with LES augments classical mixing guarantees by explicitly controlling mode volume, yielding uniform ergodicity under mild regularity.
In deep learning, smoothing via local entropy strictly decreases the spectral norm of the effective Hessian, improving stability and uniform generalization bounds (Chaudhari et al., 2016). Anisotropic or partial LES aligns the smoothing directions to those with highest informative value for test performance (Musso, 2020).

A unified implication is that search or learning dynamics, when biased by local entropy, are governed less by rare, isolated optima and more by the structure and prevalence of robust, volumetric clusters—a phenomenon consistent across synthetic, combinatorial, and machine learning problems.

References

“Local Entropy Search over Descent Sequences for Bayesian Optimization” (Stenger et al., 24 Nov 2025)
“Local entropy as a measure for sampling solutions in Constraint Satisfaction Problems” (Baldassi et al., 2015)
“Rare dense solutions clusters in asymmetric binary perceptrons -- local entropy via fully lifted RDT” (Stojnic, 24 Jun 2025)
“Entropy-Guided Sampling of Flat Modes in Discrete Spaces” (Mohanty et al., 5 May 2025)
“Partial local entropy and anisotropy in deep weight spaces” (Musso, 2020)
“Entropy-SGD: Biasing Gradient Descent Into Wide Valleys” (Chaudhari et al., 2016)