STL Robustness-Guided Bayesian Optimization

Updated 25 December 2025

The paper introduces a framework that embeds STL constraints into Bayesian optimization to preserve linguistic properties during neural model compression.
It employs layer-wise quantization and unstructured pruning guided by quantitative robustness measures and Gaussian Process surrogates.
Empirical results demonstrate significant compression gains for LLMs with formal guarantees, achieving up to 68.8% compression without retraining.

STL Robustness-Guided Bayesian Optimization is a formal framework for neural model compression that integrates Signal Temporal Logic (STL) specifications into the search for efficient quantization and pruning policies. Within the context of LLM compression, this methodology leverages quantitative STL robustness measures as constraints, steering Bayesian optimization (BO) to discover the most drastic yet property-preserving compression configurations. The technique is exemplified in the TOGGLE framework, which systematically produces compressed LLMs that are formally guaranteed to satisfy user-defined linguistic properties—all without retraining or fine-tuning (Khalil et al., 18 Dec 2025).

1. Formal Encoding of Linguistic Properties Using Signal Temporal Logic

At the core of STL Robustness-Guided Bayesian Optimization is Signal Temporal Logic, which provides a rigorous grammar for specifying temporal properties over model outputs. An inference signal, denoted $\mathbf{O}_{d,M}(t)\in\mathbb{R}^m$ for each input prompt $d$ and model $M$ , is monitored across discrete time steps. STL formulas are constructed using the syntax

$\varphi ::= \mu \mid \neg\varphi \mid \varphi_1\wedge\varphi_2 \mid \mathbf{G}_{[a,b]}\varphi \mid \mathbf{F}_{[a,b]}\varphi,$

where $\mu$ is an atomic predicate (e.g., $f(\mathbf{O}(t))\geq 0$ ), $\mathbf{G}_{[a,b]}$ denotes "always" over $[a,b]$ , and $\mathbf{F}_{[a,b]}$ denotes "eventually" over the same interval.

The robustness of a trace $\mathbf{O}$ with respect to $\varphi$ at time $t$ , $\rho(\varphi, \mathbf{O}, t)$ , quantifies the margin by which the property holds. The semantics are recursively defined:

$\rho(\mu, t) = f(\mathbf{O}(t))$
$\rho(\neg\varphi, t) = -\rho(\varphi, t)$
$\rho(\varphi_1 \wedge \varphi_2, t) = \min(\rho(\varphi_1, t), \rho(\varphi_2, t))$
$\rho(\mathbf{G}_{[a,b]} \varphi, t) = \min_{t'\in[t+a, t+b]} \rho(\varphi, t')$
$\rho(\mathbf{F}_{[a,b]} \varphi, t) = \max_{t'\in[t+a, t+b]} \rho(\varphi, t')$

This formalism enables specification of high-level properties such as sequential coherence, long-range dependency, contextual consistency, and factual accuracy on the output trajectories of LLMs.

2. Layer-Wise Compression Search Space

The compression policy search space is defined per layer $l$ and component $c$ . Each configuration $K$ assigns values to quantization bit-widths $b_{l,c}\in B\subseteq\{2,\dots,16\}$ and unstructured pruning ratios $p_{l,c}\in P\subseteq\{0.0,0.1,\dots,0.5\}$ : $K = \{ (l,c) \mapsto (b_{l,c}, p_{l,c}) \mid l\in L,\, c\in C_{\mathrm{components}} \}.$ For bit-widths $b_{l,c} \geq 3$ , Learned Step-size Quantization (LSQ) is used, whereas $b_{l,c} < 3$ invokes StretchedElasticQuant. Unstructured pruning is performed by eliminating the lowest-magnitude weights in the targeted parameters. The configuration space thus scales exponentially in the product $|L|\times|C_{\mathrm{components}}|$ .

3. Robustness-Guided Bayesian Optimization Procedure

The optimization problem is to minimize computational cost $E(K)$ (estimated in FLOPs) subject to STL robustness constraints across all properties: $K^* = \arg\min_{K \in \mathcal{C}} E(K) \quad \text{s.t.}\quad \rho(\varphi_i, \mathbf{O}_{d, M_{\mathrm{comp}(K)}}, 0) \geq p_{\mathrm{th}}(\varphi_i) \quad \forall i, d.$ Gaussian Process (GP) surrogates model both $E(K)$ and minimum robustness $p_{\min,i}(K)$ for each property $i$ . The acquisition function is

$\alpha(K) = \mathbb{E}\left[\max(0, E_{\min} - E(K))\right] \times \prod_i \Pr\left(\rho_i(K) \ge 0\right),$

which reflects expected improvement under the constraint that all STL properties are satisfied (robustness thresholds $p_{\mathrm{th}}(\varphi_i) = 0$ ). The loop proposes new $K_k$ , evaluates $E(K_k)$ and $p_{\min,i}(K_k)$ , checks feasibility, and iteratively updates the GP models.

4. Operating-Mode Selection and Evaluation Metrics

Three representative compression modes are selected by targeting Average Property Preservation (AvgPP) scores of approximately 99% (Strict), 95% (Optimal), and 85% (Relaxed), and selecting the lowest-cost feasible configuration in each mode. AvgPP is computed as

$\mathrm{AvgPP}(K) = \frac{1}{n}\sum_{i=1}^n \mathrm{PS}_i(K)\times 100\%,$

where each $\mathrm{PS}_i(K)$ measures relative property preservation versus the uncompressed baseline.

Key experimental parameters:

Predicate thresholds: $\epsilon=0.25$ (JSD), $\delta=0.70$ (attention similarity), $\gamma=0.70$ (embedding similarity), $\tau=0.70$ (probability ratio).
Optimization budget: 200 iterations per model ( $\sim$ 360 GPU-hours with 4×NVIDIA A100).
Evaluation datasets: LAMBADA (sequential coherence), WikiText-2 (long-range dependency), multi-turn dialogs (contextual consistency), TruthfulQA (factual accuracy).

5. Empirical Results: Compression Gains and Formal Guarantees

Compression results for GPT-2, DeepSeek-V2 7B, LLaMA 3 8B, and Mistral 7B across three operating modes are summarized as follows:

Model	Mode	Compression (%)	Model Size Reduct. (MB)	FLOPs Reduct. (×)
GPT-2	Strict	15.0	42.2	1.2
	Optimal	43.8	71.4	2.0
	Relaxed	60.9	113.9	2.8
DeepSeek-V2 7B	Strict	18.8	4,500	1.3
	Optimal	48.1	6,800	2.1
	Relaxed	65.0	9,100	3.0
LLaMA 3 8B	Strict	10.0	1,330	1.1
	Optimal	40.0	5,320	1.8
	Relaxed	59.4	7,900	2.6
Mistral 7B	Strict	21.9	2,100	1.3
	Optimal	53.1	3,500	2.3
	Relaxed	68.8	6,550	3.3

For all models, the feasibility constraint $p_{\min, i}(K) \geq 0$ for all properties is satisfied. This confers formal guarantees: for any compressed configuration $M_{\mathrm{comp}(K)}$ produced by the optimization, every STL-specified linguistic property holds throughout inference on all prompts, with no weight updates to the model.

6. Theoretical Properties and Guarantees

The principal theoretical contribution is a guarantee that any compressed LLM generated via this STL-constrained process upholds all the formalized properties: if $\rho(\varphi_i, \mathbf{O}, t) \geq 0$ for all $t \in [1, T']$ and all $\varphi_i$ , then the compressed model $M_{\mathrm{comp}(K)}$ satisfies those properties for all data $d$ . The proof is by induction on the structure of the STL formula and relies on the monotonic effect of quantization and pruning on the monitored output traces.

No retraining or fine-tuning is required; the compression is purely post hoc. As a result, model deployment on resource-constrained edge hardware is possible with provable preservation of selected high-level behaviors.

7. Limitations and Prospective Extensions

Major limitations of current STL Robustness-Guided Bayesian Optimization as instantiated in TOGGLE include:

Static STL thresholds ( $\epsilon, \delta, \gamma, \tau$ ) are fixed a priori and not adaptively recalibrated.
Only unstructured pruning is supported; structured options such as head-level or channel pruning are not incorporated.
The cost model employs FLOPs as a proxy for resource efficiency; real-world latency or memory footprint is not directly optimized and may diverge from the FLOPs metric.
BO search is challenged in very high-dimensional spaces, with finite query budgets potentially missing exotic or highly sparse solutions.
STL monitors are limited to local traces ( $T'$ ) and do not capture global, long-horizon behaviors outside this window.
No explicit hardware-in-the-loop evaluation; actual performance on diverse edge devices may differ from optimized metrics.

Potential directions for future enhancements involve integrating latency and memory-aware objectives into BO, supporting structured pruning modalities, extending to multi-modal or multi-task models, and enabling online adaptation of STL thresholds in response to runtime statistics (Khalil et al., 18 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

TOGGLE: Temporal Logic-Guided Large Language Model Compression for Edge (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to STL Robustness-Guided Bayesian Optimization.