Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 27 tok/s
GPT-5 High 22 tok/s Pro
GPT-4o 89 tok/s
GPT OSS 120B 457 tok/s Pro
Kimi K2 169 tok/s Pro
2000 character limit reached

Landscape-Smoothed SAM (LSAM)

Updated 5 September 2025
  • Landscape-Smoothed SAM is an optimization framework that combines adversarial perturbation, kernel smoothing, and asynchronous sampling to enhance scalability and robustness in distributed deep learning.
  • It mitigates traditional sharpness-aware minimization limitations by smoothing the loss landscape and decoupling adversarial computation from synchronization, enabling efficient large-batch training.
  • Empirical results on benchmarks like SVHN and CIFAR demonstrate that LSAM achieves lower test errors and faster convergence compared to classical methods.

Landscape-Smoothed SAM (LSAM) is an optimization framework within sharpness-aware minimization that integrates adversarial perturbation, kernel-based smoothing, and asynchronous distributed sampling to achieve efficient and robust generalization for deep learning, particularly in large-batch distributed settings. LSAM addresses the scalability limits of traditional SAM by smoothing the adversarial loss landscape and enabling efficient training without synchronization bottlenecks in distributed systems.

1. Motivation, Overview, and Distinction from Classical SAM

Sharpness-Aware Minimization (SAM) augments standard empirical risk minimization by searching for model parameters that minimize the worst-case loss within a local neighborhood, effectively steering the optimization toward flat minima with empirically verified generalization benefits. However, the SAM adversarial step—finding ϵ(x)\epsilon^*(x) such that f(x+ϵ(x))f(x + \epsilon^*(x)) is maximized for ϵρ\|\epsilon\| \leq \rho—relies on synchronous, centralized computation and is inefficient in large batch-size, distributed scenarios. In conventional data-parallel frameworks, increasing the number of workers either decreases per-worker batch size or increases the effective batch size, both of which degrade SAM's convergence and generalization properties.

Landscape-Smoothed SAM addresses these deficiencies by introducing a smoothed sharpness-aware objective, integrating kernel convolution and asynchronous sampling, and decoupling adversarial computation from synchronization constraints. LSAM can be interpreted as a convolutional Boltzmann-like objective that both preserves the adversarial structure of SAM and enables scalable distributed learning.

2. Mathematical Formulation and Kernel Smoothing

The core formulation of LSAM reinterprets SAM's maximization via a probabilistic view. SAM's update can be framed as

πSAM(x)exp(EξD[f(x+ϵ(x);ξ)]),\pi_{SAM}(x) \propto \exp\left( - \mathbb{E}_{\xi \sim \mathcal{D}} [ f(x + \epsilon^*(x); \xi) ] \right),

where ϵ(x)\epsilon^*(x) is the solution to the inner maximization problem over a ρ\rho-radius ball in parameter space.

LSAM generalizes this by introducing a landscape-smoothing convolution. Concretely, the optimized distribution is

πLSAM(y)exp(f(Tρ,γ(x)))exp(k(x,y))dx,\pi_{LSAM}(y) \propto \int \exp(-f(T_{\rho,\gamma}(x))) \exp(-k(x, y)) dx,

where

  • Tρ,γ(x)=x+ρf(x)f(x)+γT_{\rho,\gamma}(x) = x + \frac{\rho \nabla f(x)}{ \| \nabla f(x) \| + \gamma } is the adversarially perturbed parameter (with γ>0\gamma > 0 for numerical stability),
  • k(x,y)k(x, y) is a symmetric positive-definite kernel, typically Gaussian, and
  • f()f(\cdot) is the task loss.

This convolution produces a smoothed surrogate landscape, inherited from both the adversarial (sharpness-aware) and local averaging properties, mitigating spurious minima and instability induced by high curvature or sharp regions.

3. Asynchronous Distributed Sampling and Optimization

LSAM is designed for asynchronous distributed environments. Its computational mechanism consists of two principal loops:

a. Inner Sampling Loop

To evaluate gradients with respect to the kernel-smoothed distribution, LSAM employs (stochastic) Langevin dynamics or SGLD to sample from the conditional

q(xy)exp(f(Tρ,γ(x))k(x,y)).q(x|y) \propto \exp(-f(T_{\rho,\gamma}(x)) - k(x, y)).

Each worker, operating asynchronously, samples xx conditioned on the current "center" parameter yy, thereby avoiding the need for synchronous communication after every minibatch.

Given sampled xq(y)x \sim q(\cdot | y), the score for updating yy is approximated as

ylogπLSAM(y)=Exq(y)[yk(x,y)].\nabla_y \log \pi_{LSAM}(y) = - \mathbb{E}_{x \sim q(\cdot | y)} [ \nabla_y k(x, y) ].

b. Outer Optimization Loop

Aggregating scores from all workers, the center parameter yy is updated via an accelerated gradient step. One instance of the LSAM update is

xt+1=xtηt(gt+λ(xtyt)), yt+1=αxt+1+(1α)yt,\begin{align*} x_{t+1} &= x_t - \eta_t \big( g_t + \lambda (x_t - y_t) \big), \ y_{t+1} &= \alpha x_{t+1} + (1-\alpha) y_t, \end{align*}

where gtg_t is the aggregated score, λ\lambda and α\alpha are hyperparameters, and the gradient can be momentum-augmented via

gtgt+β(gtgt1),g_t \leftarrow g'_t + \beta (g'_t - g'_{t-1}),

with β\beta the momentum parameter and gtg'_t the newly computed score from the current iteration.

Synchronization among workers is performed only every τ\tau iterations, significantly reducing communication frequency and overhead compared to standard data-parallel approaches.

4. Algorithmic Implementation and Workflow

The practical workflow of LSAM is summarized as follows:

Phase Operation
Inner Sampling Each worker generates samples by running Langevin (or SGLD) dynamics from q(xy)q(x|y), conditioned on latest center parameter yy.
Score Estimation Workers independently estimate yk(x,y)-\nabla_y k(x, y) for their local samples and report to the central aggregator.
Score Aggregation The aggregator collects the scores, performs moving average or momentum updates as needed, and produces the global update vector.
Parameter Update The center parameter yy (and optionally local replicas xx) are updated asynchronously using the aggregated score and possibly acceleration.
Synchronization Communication occurs only every τ\tau iterations, after which all workers refresh their center parameter to the latest (global) yy.

This workflow ensures that the exploration/exploitation of the loss landscape is robust to runtime and network heterogeneity, and that parameter updates fully reflect both adversarial perturbation and landscape smoothing.

5. Performance, Scalability, and Empirical Results

Empirical evaluations demonstrate that LSAM achieves state-of-the-art performance with improved scalability:

  • On SVHN with CNN-5 and ResNet-18 architectures, LSAM achieved test errors of approximately 4.62% and 3.04% respectively, lower than baselines including data-parallel SAM (DP-SAM), LSGD, and EASGD.
  • On CIFAR-10 and CIFAR-100, LSAM consistently delivered lower final test errors and accelerated convergence compared to other distributed sharpness-aware or kernel-based optimizers.

The asynchronous sampling mechanism eliminates performance degradation under large-batch conditions common in modern distributed deep learning. By decoupling gradient estimation and parameter synchronization, LSAM sidesteps the mini-batch inflation and per-worker data starvation problems that hamper standard SAM.

Theoretical analysis in the cited work establishes that LSAM's iterates converge to stationary points at the same rate as classical SGD, preserving the efficiency guarantees needed for practical large-scale deployment.

6. Generalization and Landscape Properties

LSAM's generalization advantage is twofold. First, the retained adversarial perturbation mechanism of SAM ensures strong sharpness-awareness, pushing solutions toward wide, flat minima correlated with improved population performance. Second, the kernel smoothing (via convolution) prevents convergence to narrow, sharp local minima, thereby biasing optimization toward broader and deeper wells in the loss landscape. This is validated both by empirical sharpness metrics and by loss variance reduction in the neighborhood of converged solutions.

7. Limitations, Extensions, and Future Directions

While LSAM presents significant advances, several aspects are potential avenues for further research:

  • Communication minimization: Reducing the synchronization frequency τ\tau or employing adaptive communication scheduling could yield further improvements in efficiency, particularly in heterogeneous or bandwidth-constrained settings.
  • Heterogeneous data and federated learning: Adapting LSAM to non-i.i.d. data or to federated environments introduces new challenges related to loss geometry diversity and local-global parameter divergence.
  • Kernel adaptation: Exploring alternative kernel choices beyond Gaussian and adaptive kernel bandwidths may yield improved smoothing tailored to the local structure of the loss landscape.
  • Adaptive adversarial radius: Integration with adaptive perturbation size similar to recent "parameter-agnostic" approaches could further enhance robustness and reduce hyperparameter tuning.

In summary, Landscape-Smoothed SAM (LSAM) synthesizes sharpness-aware adversarial optimization with landscape smoothing and asynchronous distributed sampling, yielding an optimizer that is simultaneously robust, scalable, efficient, and capable of delivering strong generalization in large-scale deep learning (Teng et al., 3 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube