Papers
Topics
Authors
Recent
2000 character limit reached

Local Entropy Search (LES) Overview

Updated 1 December 2025
  • Local Entropy Search (LES) is an approach that quantifies the log-density of nearby solutions to favor flat, robust basins over isolated sharp optima.
  • LES techniques employ Gaussian-averaged free energies and modeled descent sequences to enhance sample efficiency and reduce regret in Bayesian optimization and constraint satisfaction problems.
  • By regularizing the optimization landscape, LES improves algorithmic tractability in combinatorial tasks and generalization in deep neural network training.

Local Entropy Search (LES) is a family of algorithms and analysis techniques that leverage local entropy—the log-density or free energy of solutions in a neighborhood of the state or parameter space—to bias search (optimization or sampling) toward “flat” or dense basins, as opposed to isolated sharp optima. LES methodologies have been developed for Bayesian optimization, constraint satisfaction problems, discrete sampling, and deep neural network training, providing both algorithmic tools and theoretical insights into the structure and tractability of complex energy landscapes.

1. Local Entropy: Formalism and Motivations

Local entropy quantifies the log-weighted volume of solutions or low-cost configurations in the proximity of a specified reference point. For a generic differentiable loss L(θ)L(\theta) with parameters θRd\theta\in\R^d, the isotropic local entropy is given by a Gaussian-averaged free energy:

Siso(θ;β,σ)=1βlogRdexp[βL(θ+ξ)]N(ξ;0,σ2I)dξS_{\mathrm{iso}}(\theta;\beta,\sigma) = -\frac{1}{\beta} \log \int_{\R^d} \exp\bigl[-\beta L(\theta+\xi)\bigr]\,\mathcal{N}(\xi;0, \sigma^2 I)\,d\xi

or, equivalently, using γ=1/σ2\gamma=1/\sigma^2,

F(θ;β,γ)=logRdexp[βL(w)γ2wθ2]dwF(\theta; \beta,\gamma) = - \log \int_{\R^d} \exp\left[-\beta L(w') - \frac{\gamma}{2}\|w' - \theta\|^2\right]\,dw'

In discrete spaces, for example for x{0,1}dx \in \{0,1\}^d and log-probability U(x)U(x), the local entropy centered at zz is:

S(z;η)=logxexp{U(x)12ηxz2}S(z;\eta) = \log \sum_{x} \exp\left\{U(x) - \frac{1}{2\eta} \|x-z\|^2\right\}

These forms measure not only the “depth” of cost/energy at θ\theta but also the “flatness” in its neighborhood, favoring solutions robust under local perturbations. This concept has been exploited across domains:

2. LES in Bayesian Optimization: Descent-Sequences and Mutual Information

In “Local Entropy Search over Descent Sequences for Bayesian Optimization,” LES is defined as a sample-efficient, local Bayesian optimization strategy that explicitly models the probability distribution over descent sequences produced by a local optimizer O\mathcal{O} (e.g., gradient descent) acting on the Gaussian process (GP) surrogate posterior (Stenger et al., 24 Nov 2025).

Central to this approach is the acquisition function

αLES(x)=I((x,y(x));Qx0Dt)\alpha_{\mathrm{LES}}(x) = I\bigl( (x, y(x)); Q_{x_0}\mid \mathcal{D}_t \bigr)

where Qx0Q_{x_0} is the random descent sequence trajectory under the optimizer O\mathcal{O} initialized at x0x_0, and Dt\mathcal{D}_t the current data. This mutual information can be written as

$I\bigl((x,y);Q\mid\mathcal{D}_t\bigr) = H\bigl[y(x)\mid\mathcal{D}_t\bigr] - \E_{f\sim \mathrm{GP}(\mathcal{D}_t)}\left[ H\bigl( y(x)\mid \mathcal{D}_t, Q_{x_0}(f) \bigr) \right]$

Using posterior function samples, descent trajectories are simulated under O\mathcal{O}, and the acquisition function is efficiently computed using analytic GP entropy and Monte Carlo averaging. Key empirical findings include:

  • Markedly lower simple and cumulative regret on high-complexity, high-dimensional synthetic and real-world benchmarks relative to global/other local BO methods.
  • Theoretical guarantees: density of query points; a rigorous probabilistic stopping rule based on local optimality certificates.
  • Practicalities: supports any differentiable local optimizer, is robust to inner heuristics (conditioning on function values vs. gradients), and is extensible to batched, constrained, or multi-fidelity BO.

LES, as applied here, operationalizes the entropy search principle not globally (“where is the optimum?”) but over the local descent basin, directly reducing uncertainty over reachable optima via descent (Stenger et al., 24 Nov 2025).

3. Local Entropy Search in Constraint Satisfaction and Combinatorial Landscapes

LES has been developed as an explicit solver framework for CSPs in the form of Entropy-driven Monte Carlo (EdMC) (Baldassi et al., 2015). Here, local entropy is defined as the logarithmic density of solutions at a fixed overlap SS with a reference configuration x~:

M(S)=1NlogN(x~,S)\mathcal{M}(S) = \frac{1}{N}\langle\log N(x̃^*, S)\rangle

where

N(x~,S)=x[μψμ(xμ)]δ(xx~SN)N(x̃, S) = \sum_x \left[\prod_\mu \psi_\mu(x_{\partial_\mu})\right] \delta(x \cdot x̃ - S N)

Large-deviation analysis of the binary perceptron problem reveals that, for constraint densities α<αc\alpha < \alpha_c, ultra-dense subdominant solution clusters exist, even when isolated global solutions are algorithmically inaccessible. EdMC, using a Metropolis zero-temperature search on F(x~,γ)F(x̃, \gamma), efficiently identifies these clusters, outperforming both standard Simulated Annealing and belief-propagation-based solvers for both perceptron and random KK-SAT.

Algorithmically, EdMC proceeds by:

  • Picking a reference x~.
  • Running BP to estimate F(x~,γ)F(x̃, \gamma).
  • Proposing local flips; accepting based on the change in FF.
  • Applying “scoping” (increasing γ\gamma) and optional annealing on temperature.

Empirically, EdMC scales polynomially in the size NN where SA scales exponentially or fails. For asymmetric binary perceptrons, analysis using lifted random duality theory (fl RDT and sfl LD RDT) confirms that the LES breakdown closely matches the empirical algorithmic threshold for tractability: as constraint density α\alpha increases past a narrow window (0.77<α<0.780.77 < \alpha < 0.78), clusters with positive local entropy vanish, leading to computational hardness (Stojnic, 24 Jun 2025). This supports the interpretation that rare, exponentially large dense clusters—not typical isolated solutions—govern the practical tractability of large CSPs (Baldassi et al., 2015, Stojnic, 24 Jun 2025).

4. LES in Discrete Sampling and Combinatorial Optimization

Local entropy has been adopted as a regularizer in Markov Chain Monte Carlo for discrete spaces to favor “flat” (voluminous) modes, important in robust combinatorial optimization and probabilistic modeling (Mohanty et al., 5 May 2025). The Entropic Discrete Langevin Proposal (EDLP) introduces a continuous auxiliary variable zz tied to the discrete state xx, creating a joint distribution:

π(x,z)exp{U(x)12ηxz2}\pi(x, z) \propto \exp\left\{U(x) - \frac{1}{2\eta}\|x-z\|^2\right\}

Sampling is performed by alternating (i) discrete proposals biased by the local-entropy gradient and (ii) Langevin (Gaussian) updates in the auxiliary variable. This strategy:

  • Directs the chain toward neighborhoods with high local entropy (robust, flat modes) rather than narrow isolated peaks.
  • Guarantees non-asymptotic convergence under locally log-concave assumptions, and admits exact mixing bounds for both uncorrected (EDULA) and Metropolis-Hastings-corrected (EDMALA) variants.
  • Empirically dominates standard discrete Langevin and Metropolis samplers for synthetic Bernoulli distributions, restricted Boltzmann machines, TSPs, and Bayesian neural networks, specifically in capturing wide, robust solution sets (Mohanty et al., 5 May 2025).

LES variants in this class thus change the exploration/transitions of MCMC to reflect not just pointwise probability but also the local “entropic volume” associated with each state.

5. LES as Optimization Regularizer: Deep Learning and Anisotropy

In deep learning, local entropy objectives have been proposed and extensively analyzed as regularizers that bias training toward wide optima, empirically improving generalization and robustness (Chaudhari et al., 2016, Musso, 2020). The basic approach is to modify standard SGD to optimize the local-entropy-averaged loss F(x;γ)F(x; \gamma), as detailed above.

The gradient of this local entropy is given by

xF(x;γ)=1γEwp(wx)[wx],p(wx)exp(f(w)12γwx2)\nabla_x F(x; \gamma) = \frac{1}{\gamma} \mathbb{E}_{w \sim p(w|x)}[w - x],\quad p(w|x) \propto \exp\left(-f(w) - \frac{1}{2\gamma}\|w-x\|^2\right)

This expectation is approximated using inner-loop Langevin dynamics (as in SGLD), leading to Entropy-SGD:

  • Outer loop: SGD on xx with gradient xμx - \mu (where μ\mu is the Langevin mean).
  • Inner loop: LL steps of SGLD on ww initialized at xx, using the composite energy f(w)+12γwx2f(w) + \frac{1}{2\gamma}\|w-x\|^2.

Partial local entropy regularization refines this framework by smoothing only along selected, typically high-anisotropy directions (e.g., per-layer or blockwise), using a block-diagonal covariance Σp\Sigma_p to define the smoothing kernel (Musso, 2020). This setup:

  • Adapts regularization to the observed anisotropic gradient noise statistics of modern networks.
  • Confers persistent test-time accuracy gains when applied to select layers (e.g., second or last, non-convolutional layers), with negligible or negative impact when smoothing sharp or structurally-constrained parameters.

Empirical studies confirm that layerwise temperature (the variance of gradients within a layer) decays jointly across all layers late in training, and that partial LES can be tuned to these anisotropic dynamics for optimal regularization benefit (Musso, 2020).

6. Algorithmic and Practical Considerations

Across domains, the implementation of LES requires choices (reference points, hyperparameters) and tradeoffs:

Domain LES Reference / Region Main Computational Step Practical Notes
Bayesian Optimization Current incumbent; descent trajectory GP posterior propagation, entropy computation Local optimizer (GD/ADAM/CMA-ES) interchangeable (Stenger et al., 24 Nov 2025)
CSP/SAT/Perceptron Reference state at overlap S BP, Metropolis/collective moves on F(x~,γ)F(x̃,\gamma) Soft/hard constraint focus; scoping and annealing (Baldassi et al., 2015)
Discrete Sampling Auxiliary anchor variable zz Langevin update in zz, entropy-biased discrete proposals MCMC mixing proved for log-concave settings (Mohanty et al., 5 May 2025)
Neural Nets Parameter vector or sub-block Inner SGLD loop, entropy gradient over local region Smoothing covariance tailored to layerwise anisotropy (Chaudhari et al., 2016, Musso, 2020)

Key parameters include the scope (radius) of the entropy measure (σ\sigma, γ\gamma, η\eta), number of samples for expectation or descent sequence simulation, and, for partial entropy, selection of subspaces (layers or coordinates). Computational cost is dominated by the stochastic inner loop (SGLD, BP, or optimizer trajectories), but scale is controllable via parallelism, subsampling, or analytic entropy evaluations.

7. Theoretical Insights and Implications

LES rigorously relates tractability and performance to the existence of large, dense solution clusters in complex stochastic landscapes:

  • For CSPs, high local entropy near solution clusters correlates with polynomial-time tractability; its breakdown marks the onset of hard phases where only isolated solutions remain (Baldassi et al., 2015, Stojnic, 24 Jun 2025).
  • In Bayesian optimization, the density of queries and mutual information decrease yield an (ε,δ)(\varepsilon,\delta) certificate of local optimality and facilitate principled stopping rules (Stenger et al., 24 Nov 2025).
  • Discrete sampling with LES augments classical mixing guarantees by explicitly controlling mode volume, yielding uniform ergodicity under mild regularity.
  • In deep learning, smoothing via local entropy strictly decreases the spectral norm of the effective Hessian, improving stability and uniform generalization bounds (Chaudhari et al., 2016). Anisotropic or partial LES aligns the smoothing directions to those with highest informative value for test performance (Musso, 2020).

A unified implication is that search or learning dynamics, when biased by local entropy, are governed less by rare, isolated optima and more by the structure and prevalence of robust, volumetric clusters—a phenomenon consistent across synthetic, combinatorial, and machine learning problems.

References

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Local Entropy Search (LES).