Papers
Topics
Authors
Recent
2000 character limit reached

STL Robustness-Guided Bayesian Optimization

Updated 25 December 2025
  • The paper introduces a framework that embeds STL constraints into Bayesian optimization to preserve linguistic properties during neural model compression.
  • It employs layer-wise quantization and unstructured pruning guided by quantitative robustness measures and Gaussian Process surrogates.
  • Empirical results demonstrate significant compression gains for LLMs with formal guarantees, achieving up to 68.8% compression without retraining.

STL Robustness-Guided Bayesian Optimization is a formal framework for neural model compression that integrates Signal Temporal Logic (STL) specifications into the search for efficient quantization and pruning policies. Within the context of LLM compression, this methodology leverages quantitative STL robustness measures as constraints, steering Bayesian optimization (BO) to discover the most drastic yet property-preserving compression configurations. The technique is exemplified in the TOGGLE framework, which systematically produces compressed LLMs that are formally guaranteed to satisfy user-defined linguistic properties—all without retraining or fine-tuning (Khalil et al., 18 Dec 2025).

1. Formal Encoding of Linguistic Properties Using Signal Temporal Logic

At the core of STL Robustness-Guided Bayesian Optimization is Signal Temporal Logic, which provides a rigorous grammar for specifying temporal properties over model outputs. An inference signal, denoted Od,M(t)Rm\mathbf{O}_{d,M}(t)\in\mathbb{R}^m for each input prompt dd and model MM, is monitored across discrete time steps. STL formulas are constructed using the syntax

φ::=μ¬φφ1φ2G[a,b]φF[a,b]φ,\varphi ::= \mu \mid \neg\varphi \mid \varphi_1\wedge\varphi_2 \mid \mathbf{G}_{[a,b]}\varphi \mid \mathbf{F}_{[a,b]}\varphi,

where μ\mu is an atomic predicate (e.g., f(O(t))0f(\mathbf{O}(t))\geq 0), G[a,b]\mathbf{G}_{[a,b]} denotes "always" over [a,b][a,b], and F[a,b]\mathbf{F}_{[a,b]} denotes "eventually" over the same interval.

The robustness of a trace O\mathbf{O} with respect to φ\varphi at time tt, ρ(φ,O,t)\rho(\varphi, \mathbf{O}, t), quantifies the margin by which the property holds. The semantics are recursively defined:

  • ρ(μ,t)=f(O(t))\rho(\mu, t) = f(\mathbf{O}(t))
  • ρ(¬φ,t)=ρ(φ,t)\rho(\neg\varphi, t) = -\rho(\varphi, t)
  • ρ(φ1φ2,t)=min(ρ(φ1,t),ρ(φ2,t))\rho(\varphi_1 \wedge \varphi_2, t) = \min(\rho(\varphi_1, t), \rho(\varphi_2, t))
  • ρ(G[a,b]φ,t)=mint[t+a,t+b]ρ(φ,t)\rho(\mathbf{G}_{[a,b]} \varphi, t) = \min_{t'\in[t+a, t+b]} \rho(\varphi, t')
  • ρ(F[a,b]φ,t)=maxt[t+a,t+b]ρ(φ,t)\rho(\mathbf{F}_{[a,b]} \varphi, t) = \max_{t'\in[t+a, t+b]} \rho(\varphi, t')

This formalism enables specification of high-level properties such as sequential coherence, long-range dependency, contextual consistency, and factual accuracy on the output trajectories of LLMs.

2. Layer-Wise Compression Search Space

The compression policy search space is defined per layer ll and component cc. Each configuration KK assigns values to quantization bit-widths bl,cB{2,,16}b_{l,c}\in B\subseteq\{2,\dots,16\} and unstructured pruning ratios pl,cP{0.0,0.1,,0.5}p_{l,c}\in P\subseteq\{0.0,0.1,\dots,0.5\}: K={(l,c)(bl,c,pl,c)lL,cCcomponents}.K = \{ (l,c) \mapsto (b_{l,c}, p_{l,c}) \mid l\in L,\, c\in C_{\mathrm{components}} \}. For bit-widths bl,c3b_{l,c} \geq 3, Learned Step-size Quantization (LSQ) is used, whereas bl,c<3b_{l,c} < 3 invokes StretchedElasticQuant. Unstructured pruning is performed by eliminating the lowest-magnitude weights in the targeted parameters. The configuration space thus scales exponentially in the product L×Ccomponents|L|\times|C_{\mathrm{components}}|.

3. Robustness-Guided Bayesian Optimization Procedure

The optimization problem is to minimize computational cost E(K)E(K) (estimated in FLOPs) subject to STL robustness constraints across all properties: K=argminKCE(K)s.t.ρ(φi,Od,Mcomp(K),0)pth(φi)i,d.K^* = \arg\min_{K \in \mathcal{C}} E(K) \quad \text{s.t.}\quad \rho(\varphi_i, \mathbf{O}_{d, M_{\mathrm{comp}(K)}}, 0) \geq p_{\mathrm{th}}(\varphi_i) \quad \forall i, d. Gaussian Process (GP) surrogates model both E(K)E(K) and minimum robustness pmin,i(K)p_{\min,i}(K) for each property ii. The acquisition function is

α(K)=E[max(0,EminE(K))]×iPr(ρi(K)0),\alpha(K) = \mathbb{E}\left[\max(0, E_{\min} - E(K))\right] \times \prod_i \Pr\left(\rho_i(K) \ge 0\right),

which reflects expected improvement under the constraint that all STL properties are satisfied (robustness thresholds pth(φi)=0p_{\mathrm{th}}(\varphi_i) = 0). The loop proposes new KkK_k, evaluates E(Kk)E(K_k) and pmin,i(Kk)p_{\min,i}(K_k), checks feasibility, and iteratively updates the GP models.

4. Operating-Mode Selection and Evaluation Metrics

Three representative compression modes are selected by targeting Average Property Preservation (AvgPP) scores of approximately 99% (Strict), 95% (Optimal), and 85% (Relaxed), and selecting the lowest-cost feasible configuration in each mode. AvgPP is computed as

AvgPP(K)=1ni=1nPSi(K)×100%,\mathrm{AvgPP}(K) = \frac{1}{n}\sum_{i=1}^n \mathrm{PS}_i(K)\times 100\%,

where each PSi(K)\mathrm{PS}_i(K) measures relative property preservation versus the uncompressed baseline.

Key experimental parameters:

  • Predicate thresholds: ϵ=0.25\epsilon=0.25 (JSD), δ=0.70\delta=0.70 (attention similarity), γ=0.70\gamma=0.70 (embedding similarity), τ=0.70\tau=0.70 (probability ratio).
  • Optimization budget: 200 iterations per model (\sim360 GPU-hours with 4×NVIDIA A100).
  • Evaluation datasets: LAMBADA (sequential coherence), WikiText-2 (long-range dependency), multi-turn dialogs (contextual consistency), TruthfulQA (factual accuracy).

5. Empirical Results: Compression Gains and Formal Guarantees

Compression results for GPT-2, DeepSeek-V2 7B, LLaMA 3 8B, and Mistral 7B across three operating modes are summarized as follows:

Model Mode Compression (%) Model Size Reduct. (MB) FLOPs Reduct. (×)
GPT-2 Strict 15.0 42.2 1.2
Optimal 43.8 71.4 2.0
Relaxed 60.9 113.9 2.8
DeepSeek-V2 7B Strict 18.8 4,500 1.3
Optimal 48.1 6,800 2.1
Relaxed 65.0 9,100 3.0
LLaMA 3 8B Strict 10.0 1,330 1.1
Optimal 40.0 5,320 1.8
Relaxed 59.4 7,900 2.6
Mistral 7B Strict 21.9 2,100 1.3
Optimal 53.1 3,500 2.3
Relaxed 68.8 6,550 3.3

For all models, the feasibility constraint pmin,i(K)0p_{\min, i}(K) \geq 0 for all properties is satisfied. This confers formal guarantees: for any compressed configuration Mcomp(K)M_{\mathrm{comp}(K)} produced by the optimization, every STL-specified linguistic property holds throughout inference on all prompts, with no weight updates to the model.

6. Theoretical Properties and Guarantees

The principal theoretical contribution is a guarantee that any compressed LLM generated via this STL-constrained process upholds all the formalized properties: if ρ(φi,O,t)0\rho(\varphi_i, \mathbf{O}, t) \geq 0 for all t[1,T]t \in [1, T'] and all φi\varphi_i, then the compressed model Mcomp(K)M_{\mathrm{comp}(K)} satisfies those properties for all data dd. The proof is by induction on the structure of the STL formula and relies on the monotonic effect of quantization and pruning on the monitored output traces.

No retraining or fine-tuning is required; the compression is purely post hoc. As a result, model deployment on resource-constrained edge hardware is possible with provable preservation of selected high-level behaviors.

7. Limitations and Prospective Extensions

Major limitations of current STL Robustness-Guided Bayesian Optimization as instantiated in TOGGLE include:

  • Static STL thresholds (ϵ,δ,γ,τ\epsilon, \delta, \gamma, \tau) are fixed a priori and not adaptively recalibrated.
  • Only unstructured pruning is supported; structured options such as head-level or channel pruning are not incorporated.
  • The cost model employs FLOPs as a proxy for resource efficiency; real-world latency or memory footprint is not directly optimized and may diverge from the FLOPs metric.
  • BO search is challenged in very high-dimensional spaces, with finite query budgets potentially missing exotic or highly sparse solutions.
  • STL monitors are limited to local traces (TT') and do not capture global, long-horizon behaviors outside this window.
  • No explicit hardware-in-the-loop evaluation; actual performance on diverse edge devices may differ from optimized metrics.

Potential directions for future enhancements involve integrating latency and memory-aware objectives into BO, supporting structured pruning modalities, extending to multi-modal or multi-task models, and enabling online adaptation of STL thresholds in response to runtime statistics (Khalil et al., 18 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to STL Robustness-Guided Bayesian Optimization.