Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 89 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 27 tok/s

GPT-5 High 22 tok/s Pro

GPT-4o 89 tok/s

GPT OSS 120B 457 tok/s Pro

Kimi K2 169 tok/s Pro

2000 character limit reached

Landscape-Smoothed SAM (LSAM)

Updated 5 September 2025

Landscape-Smoothed SAM is an optimization framework that combines adversarial perturbation, kernel smoothing, and asynchronous sampling to enhance scalability and robustness in distributed deep learning.
It mitigates traditional sharpness-aware minimization limitations by smoothing the loss landscape and decoupling adversarial computation from synchronization, enabling efficient large-batch training.
Empirical results on benchmarks like SVHN and CIFAR demonstrate that LSAM achieves lower test errors and faster convergence compared to classical methods.

Landscape-Smoothed SAM (LSAM) is an optimization framework within sharpness-aware minimization that integrates adversarial perturbation, kernel-based smoothing, and asynchronous distributed sampling to achieve efficient and robust generalization for deep learning, particularly in large-batch distributed settings. LSAM addresses the scalability limits of traditional SAM by smoothing the adversarial loss landscape and enabling efficient training without synchronization bottlenecks in distributed systems.

1. Motivation, Overview, and Distinction from Classical SAM

Sharpness-Aware Minimization (SAM) augments standard empirical risk minimization by searching for model parameters that minimize the worst-case loss within a local neighborhood, effectively steering the optimization toward flat minima with empirically verified generalization benefits. However, the SAM adversarial step—finding $\epsilon^*(x)$ such that $f(x + \epsilon^*(x))$ is maximized for $\|\epsilon\| \leq \rho$ —relies on synchronous, centralized computation and is inefficient in large batch-size, distributed scenarios. In conventional data-parallel frameworks, increasing the number of workers either decreases per-worker batch size or increases the effective batch size, both of which degrade SAM's convergence and generalization properties.

Landscape-Smoothed SAM addresses these deficiencies by introducing a smoothed sharpness-aware objective, integrating kernel convolution and asynchronous sampling, and decoupling adversarial computation from synchronization constraints. LSAM can be interpreted as a convolutional Boltzmann-like objective that both preserves the adversarial structure of SAM and enables scalable distributed learning.

2. Mathematical Formulation and Kernel Smoothing

The core formulation of LSAM reinterprets SAM's maximization via a probabilistic view. SAM's update can be framed as

$\pi_{SAM}(x) \propto \exp\left( - \mathbb{E}_{\xi \sim \mathcal{D}} [ f(x + \epsilon^*(x); \xi) ] \right),$

where $\epsilon^*(x)$ is the solution to the inner maximization problem over a $\rho$ -radius ball in parameter space.

LSAM generalizes this by introducing a landscape-smoothing convolution. Concretely, the optimized distribution is

$\pi_{LSAM}(y) \propto \int \exp(-f(T_{\rho,\gamma}(x))) \exp(-k(x, y)) dx,$

where

$T_{\rho,\gamma}(x) = x + \frac{\rho \nabla f(x)}{ \| \nabla f(x) \| + \gamma }$ is the adversarially perturbed parameter (with $\gamma > 0$ for numerical stability),
$k(x, y)$ is a symmetric positive-definite kernel, typically Gaussian, and
$f(\cdot)$ is the task loss.

This convolution produces a smoothed surrogate landscape, inherited from both the adversarial (sharpness-aware) and local averaging properties, mitigating spurious minima and instability induced by high curvature or sharp regions.

3. Asynchronous Distributed Sampling and Optimization

LSAM is designed for asynchronous distributed environments. Its computational mechanism consists of two principal loops:

a. Inner Sampling Loop

To evaluate gradients with respect to the kernel-smoothed distribution, LSAM employs (stochastic) Langevin dynamics or SGLD to sample from the conditional

$q(x|y) \propto \exp(-f(T_{\rho,\gamma}(x)) - k(x, y)).$

Each worker, operating asynchronously, samples $x$ conditioned on the current "center" parameter $y$ , thereby avoiding the need for synchronous communication after every minibatch.

Given sampled $x \sim q(\cdot | y)$ , the score for updating $y$ is approximated as

$\nabla_y \log \pi_{LSAM}(y) = - \mathbb{E}_{x \sim q(\cdot | y)} [ \nabla_y k(x, y) ].$

b. Outer Optimization Loop

Aggregating scores from all workers, the center parameter $y$ is updated via an accelerated gradient step. One instance of the LSAM update is

$\begin{align*} x_{t+1} &= x_t - \eta_t \big( g_t + \lambda (x_t - y_t) \big), \ y_{t+1} &= \alpha x_{t+1} + (1-\alpha) y_t, \end{align*}$

where $g_t$ is the aggregated score, $\lambda$ and $\alpha$ are hyperparameters, and the gradient can be momentum-augmented via

$g_t \leftarrow g'_t + \beta (g'_t - g'_{t-1}),$

with $\beta$ the momentum parameter and $g'_t$ the newly computed score from the current iteration.

Synchronization among workers is performed only every $\tau$ iterations, significantly reducing communication frequency and overhead compared to standard data-parallel approaches.

4. Algorithmic Implementation and Workflow

The practical workflow of LSAM is summarized as follows:

Phase	Operation
Inner Sampling	Each worker generates samples by running Langevin (or SGLD) dynamics from $q(x\|y)$ , conditioned on latest center parameter $y$ .
Score Estimation	Workers independently estimate $-\nabla_y k(x, y)$ for their local samples and report to the central aggregator.
Score Aggregation	The aggregator collects the scores, performs moving average or momentum updates as needed, and produces the global update vector.
Parameter Update	The center parameter $y$ (and optionally local replicas $x$ ) are updated asynchronously using the aggregated score and possibly acceleration.
Synchronization	Communication occurs only every $\tau$ iterations, after which all workers refresh their center parameter to the latest (global) $y$ .

This workflow ensures that the exploration/exploitation of the loss landscape is robust to runtime and network heterogeneity, and that parameter updates fully reflect both adversarial perturbation and landscape smoothing.

5. Performance, Scalability, and Empirical Results

Empirical evaluations demonstrate that LSAM achieves state-of-the-art performance with improved scalability:

On SVHN with CNN-5 and ResNet-18 architectures, LSAM achieved test errors of approximately 4.62% and 3.04% respectively, lower than baselines including data-parallel SAM (DP-SAM), LSGD, and EASGD.
On CIFAR-10 and CIFAR-100, LSAM consistently delivered lower final test errors and accelerated convergence compared to other distributed sharpness-aware or kernel-based optimizers.

The asynchronous sampling mechanism eliminates performance degradation under large-batch conditions common in modern distributed deep learning. By decoupling gradient estimation and parameter synchronization, LSAM sidesteps the mini-batch inflation and per-worker data starvation problems that hamper standard SAM.

Theoretical analysis in the cited work establishes that LSAM's iterates converge to stationary points at the same rate as classical SGD, preserving the efficiency guarantees needed for practical large-scale deployment.

6. Generalization and Landscape Properties

LSAM's generalization advantage is twofold. First, the retained adversarial perturbation mechanism of SAM ensures strong sharpness-awareness, pushing solutions toward wide, flat minima correlated with improved population performance. Second, the kernel smoothing (via convolution) prevents convergence to narrow, sharp local minima, thereby biasing optimization toward broader and deeper wells in the loss landscape. This is validated both by empirical sharpness metrics and by loss variance reduction in the neighborhood of converged solutions.

7. Limitations, Extensions, and Future Directions

While LSAM presents significant advances, several aspects are potential avenues for further research:

Communication minimization: Reducing the synchronization frequency $\tau$ or employing adaptive communication scheduling could yield further improvements in efficiency, particularly in heterogeneous or bandwidth-constrained settings.
Heterogeneous data and federated learning: Adapting LSAM to non-i.i.d. data or to federated environments introduces new challenges related to loss geometry diversity and local-global parameter divergence.
Kernel adaptation: Exploring alternative kernel choices beyond Gaussian and adaptive kernel bandwidths may yield improved smoothing tailored to the local structure of the loss landscape.
Adaptive adversarial radius: Integration with adaptive perturbation size similar to recent "parameter-agnostic" approaches could further enhance robustness and reduce hyperparameter tuning.

In summary, Landscape-Smoothed SAM (LSAM) synthesizes sharpness-aware adversarial optimization with landscape smoothing and asynchronous distributed sampling, yielding an optimizer that is simultaneously robust, scalable, efficient, and capable of delivering strong generalization in large-scale deep learning (Teng et al., 3 Sep 2025).

PDF Markdown Chat (Upgrade)

References (1)

LSAM: Asynchronous Distributed Training with Landscape-Smoothed Sharpness-Aware Minimization (2025)

Landscape-Smoothed SAM (LSAM)

1. Motivation, Overview, and Distinction from Classical SAM

2. Mathematical Formulation and Kernel Smoothing

3. Asynchronous Distributed Sampling and Optimization

a. Inner Sampling Loop

b. Outer Optimization Loop

4. Algorithmic Implementation and Workflow

5. Performance, Scalability, and Empirical Results

6. Generalization and Landscape Properties

7. Limitations, Extensions, and Future Directions

Follow-up Questions

Don't miss out on important new AI/ML research

Landscape-Smoothed SAM (LSAM)

1. Motivation, Overview, and Distinction from Classical SAM

2. Mathematical Formulation and Kernel Smoothing

3. Asynchronous Distributed Sampling and Optimization

a. Inner Sampling Loop

b. Outer Optimization Loop

4. Algorithmic Implementation and Workflow

5. Performance, Scalability, and Empirical Results

6. Generalization and Landscape Properties

7. Limitations, Extensions, and Future Directions

Follow-up Questions

Related Topics

Don't miss out on important new AI/ML research