Papers
Topics
Authors
Recent
2000 character limit reached

AutoClip: Adaptive Clipping & Vision-Language Inference

Updated 9 December 2025
  • AutoClip is an adaptive algorithm that calibrates gradient clipping thresholds and vision-language prompt weights using data-driven statistics.
  • It replaces static hyperparameters with percentile-based thresholds to yield smoother optimization, improved SI-SDR, and minimal manual tuning.
  • AutoCLIP for vision-language inference dynamically derives per-image prompt weights, consistently enhancing classification accuracy with negligible overhead.

AutoClip refers to two distinct adaptive algorithms developed for neural network training and inference: one for adaptive gradient clipping (AutoClip for source separation networks) and another for auto-tuning zero-shot classifiers in vision-LLMs (AutoCLIP for vision–language prompt ensembling). Both approaches autonomously adapt critical hyperparameters—clipping thresholds and prompt ensemble weights—based on data-driven statistics, offering improved stability or accuracy with minimal manual tuning. The following entry details both algorithms, each rooted in a different subdomain, unified by the principle of automatic, empirical adaptation.

1. Adaptive Gradient Clipping via Percentile Norm Estimation

AutoClip, as introduced for gradient clipping in source separation networks, is a data-driven scheme that replaces the manually selected global norm threshold in clip-by-norm procedures with an adaptively chosen percentile-based bound. At each optimization step tt, the clipping threshold ηc(t)\eta_c(t) is set to the pp-th percentile np(t)n_p(t) of the historical gradient norm statistics Gh(t)G_h(t) accumulated up to that point, where pp is a user-chosen percentile parameter. Specifically, for loss f(Xt;θt1)f(X_t; \theta_{t-1}) and current gradient θf(Xt;θt1)\nabla_\theta f(X_t;\theta_{t-1}), the update is

θt=θt1λhcθf(Xt;θt1),hc=min{ηc(t)θf(Xt;θt1),1}\theta_t = \theta_{t-1} - \lambda h_c \nabla_\theta f(X_t; \theta_{t-1}), \quad h_c = \min\left\{\frac{\eta_c(t)}{\|\nabla_\theta f(X_t;\theta_{t-1})\|}, 1\right\}

with ηc(t)=np(t)\eta_c(t) = n_p(t) based on Gh(t)G_h(t), the empirical history of all prior gradient norms. Only gradients with norm in the top (100p)%(100-p)\% are clipped. This mechanism obviates the need to tune the absolute clipping threshold, instead requiring only a single percentile parameter pp (Seetharaman et al., 2020).

2. Implementation of AutoClip in Neural Network Training

AutoClip’s algorithmic simplicity allows insertion into existing training routines. At each iteration, the gradient norm is appended to Gh(t)G_h(t), the running percentile np(t)n_p(t) is computed, and gradients are rescaled only if their norm exceeds this adaptive threshold. The following pseudocode summarizes the core logic:

1
2
3
4
5
6
7
8
9
10
11
12
G_history = []      # store gradient norms
p = 10              # percentile cutoff
for t in range(1, T+1):
    X_t = next_minibatch()
    loss = compute_loss(X_t, θ)
    grads = backprop(loss, θ)
    grad_norm = norm(grads)
    G_history.append(grad_norm)
    η_c = percentile(G_history, p)
    if grad_norm > η_c:
        grads = grads * (η_c / grad_norm)
    θ = optimizer_step(θ, grads)

This method decouples hyperparameter sensitivity from problem-specific scale, integrates with optimizers such as SGD or Adam, and generalizes across domains.

3. Empirical Evaluation in Source Separation Networks

AutoClip was evaluated on the WSJ0-2mix speech separation dataset using a 4-layer bidirectional LSTM architecture with multiple objective functions: deep clustering (LDC\mathcal{L}_{\mathrm{DC}}), whitened k-means (LWKM\mathcal{L}_{\mathrm{WKM}}), mask-inference phase-sensitive loss (LMI\mathcal{L}_{\mathrm{MI}}), multi-task Chimera (LMI+WKM\mathcal{L}_{\mathrm{MI+WKM}}), and time-domain SNR (LSNR\mathcal{L}_{\mathrm{SNR}}). Models were trained with Adam (lr=10310^{-3}), batch size 25, sequence length 400, for 100 epochs.

The effect of percentile pp on SI-SDR test performance (dB) is summarized as follows:

Loss p=0p=0 p=1p=1 p=10p=10 p=25p=25 p=50p=50 p=90p=90 p=100p=100
LDC\mathcal{L}_{\mathrm{DC}} 10.7 10.7 10.8 10.7 10.7 10.5 10.2
LWKM\mathcal{L}_{\mathrm{WKM}} 11.1 11.2 11.0 11.0 11.0 11.0 10.8
LMI\mathcal{L}_{\mathrm{MI}} 10.0 10.3 10.2 9.9 9.2 8.7 8.5
LMI+WKM\mathcal{L}_{\mathrm{MI+WKM}} 11.2 11.3 11.3 11.3 11.2 11.1 10.9
LSNR\mathcal{L}_{\mathrm{SNR}} 9.9 10.2 10.4 10.3 9.9 9.5 8.3

Performance substantially deteriorates without clipping (p=100p=100), particularly for LMI\mathcal{L}_{\mathrm{MI}} and LSNR\mathcal{L}_{\mathrm{SNR}} (up to \approx2 dB loss). Percentile p=10p=10 is near-optimal across objectives and robust to extreme settings, outperforming prior static-threshold baselines (Seetharaman et al., 2020).

4. Dynamics and Loss Landscape Analysis

AutoClip’s effect on optimization dynamics was probed by tracking step size θtθt1\|\theta_t-\theta_{t-1}\|, empirical Lipschitz constant of the gradient, and gradient norm. With AutoClip (p=10p=10), the step size trajectory is smoother and exhibits built-in warmup and decay behavior. The Pearson correlation r=0.86r=0.86 (versus r=0.62r=0.62 without clipping) between gradient norm and local smoothness demonstrates that AutoClip confines the optimizer to flatter regions of the loss landscape. Restricting updates with large gradients mitigates erratic jumps and enhances generalization (final SI-SDR improved from 8.1 dB to 9.2 dB under LMI\mathcal{L}_{\mathrm{MI}}) (Seetharaman et al., 2020).

5. General Applicability, Simplicity, and Broader Relevance

The percentile-based thresholding in AutoClip is not tied to a specific optimizer or loss function. It is optimizer-agnostic, scale-invariant, and requires only a single percentile parameter—no manual tuning of absolute clipping thresholds on a per-network or per-task basis. Applicability extends beyond audio source separation to language modeling (where exploding gradients may arise), image classifiers (to avoid sharp minima), and RL or any stochastic optimization scenario (Seetharaman et al., 2020). The method is “set-and-forget,” implemented with a running list or histogram of gradient norms and a percentile computation.

6. AutoCLIP for Vision-LLM Inference

In the domain of zero-shot vision-language classification, AutoCLIP introduces automatic tuning of ensemble prompt weights per image at inference. Given a set of KK prompt templates per class, the baseline CLIP strategy uniformly averages class-descriptor similarities for classification. AutoCLIP instead derives per-image weights wΔK1w \in \Delta^{K-1} over the KK prompts using statistics of descriptor-image cosine similarities sij=cos(f(x),g(tj(ci)))s_i^j = \cos(f(x), g(t_j(c_i))).

For each image, aggregated match qualities aja_j over all classes are computed by a smooth-max (logsumexp with temperature τtext\tau_\text{text}). The weights wjw_j are then produced via a softmax over aj/τwa_j / \tau_w, balancing prompt informativeness. The final class score is Si=j=1KwjsijS_i = \sum_{j=1}^K w_j s_i^j, and classification proceeds by argmaxiSi\arg\max_{i} S_i.

This approach yields consistent accuracy improvements across CLIP-style backbone models, datasets, and prompt ensemble strategies, with gains of 0.5–3 percentage points typical for sufficiently large KK and negligible computational overhead. AutoCLIP is suitable whenever prompt-ensemble effects are nontrivial and can be implemented as a short wrapper around the standard inference pipeline (Metzen et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to AutoClip.