STL Robustness-Guided Bayesian Optimization
- The paper introduces a framework that embeds STL constraints into Bayesian optimization to preserve linguistic properties during neural model compression.
- It employs layer-wise quantization and unstructured pruning guided by quantitative robustness measures and Gaussian Process surrogates.
- Empirical results demonstrate significant compression gains for LLMs with formal guarantees, achieving up to 68.8% compression without retraining.
STL Robustness-Guided Bayesian Optimization is a formal framework for neural model compression that integrates Signal Temporal Logic (STL) specifications into the search for efficient quantization and pruning policies. Within the context of LLM compression, this methodology leverages quantitative STL robustness measures as constraints, steering Bayesian optimization (BO) to discover the most drastic yet property-preserving compression configurations. The technique is exemplified in the TOGGLE framework, which systematically produces compressed LLMs that are formally guaranteed to satisfy user-defined linguistic properties—all without retraining or fine-tuning (Khalil et al., 18 Dec 2025).
1. Formal Encoding of Linguistic Properties Using Signal Temporal Logic
At the core of STL Robustness-Guided Bayesian Optimization is Signal Temporal Logic, which provides a rigorous grammar for specifying temporal properties over model outputs. An inference signal, denoted for each input prompt and model , is monitored across discrete time steps. STL formulas are constructed using the syntax
where is an atomic predicate (e.g., ), denotes "always" over , and denotes "eventually" over the same interval.
The robustness of a trace with respect to at time , , quantifies the margin by which the property holds. The semantics are recursively defined:
This formalism enables specification of high-level properties such as sequential coherence, long-range dependency, contextual consistency, and factual accuracy on the output trajectories of LLMs.
2. Layer-Wise Compression Search Space
The compression policy search space is defined per layer and component . Each configuration assigns values to quantization bit-widths and unstructured pruning ratios : For bit-widths , Learned Step-size Quantization (LSQ) is used, whereas invokes StretchedElasticQuant. Unstructured pruning is performed by eliminating the lowest-magnitude weights in the targeted parameters. The configuration space thus scales exponentially in the product .
3. Robustness-Guided Bayesian Optimization Procedure
The optimization problem is to minimize computational cost (estimated in FLOPs) subject to STL robustness constraints across all properties: Gaussian Process (GP) surrogates model both and minimum robustness for each property . The acquisition function is
which reflects expected improvement under the constraint that all STL properties are satisfied (robustness thresholds ). The loop proposes new , evaluates and , checks feasibility, and iteratively updates the GP models.
4. Operating-Mode Selection and Evaluation Metrics
Three representative compression modes are selected by targeting Average Property Preservation (AvgPP) scores of approximately 99% (Strict), 95% (Optimal), and 85% (Relaxed), and selecting the lowest-cost feasible configuration in each mode. AvgPP is computed as
where each measures relative property preservation versus the uncompressed baseline.
Key experimental parameters:
- Predicate thresholds: (JSD), (attention similarity), (embedding similarity), (probability ratio).
- Optimization budget: 200 iterations per model (360 GPU-hours with 4×NVIDIA A100).
- Evaluation datasets: LAMBADA (sequential coherence), WikiText-2 (long-range dependency), multi-turn dialogs (contextual consistency), TruthfulQA (factual accuracy).
5. Empirical Results: Compression Gains and Formal Guarantees
Compression results for GPT-2, DeepSeek-V2 7B, LLaMA 3 8B, and Mistral 7B across three operating modes are summarized as follows:
| Model | Mode | Compression (%) | Model Size Reduct. (MB) | FLOPs Reduct. (×) |
|---|---|---|---|---|
| GPT-2 | Strict | 15.0 | 42.2 | 1.2 |
| Optimal | 43.8 | 71.4 | 2.0 | |
| Relaxed | 60.9 | 113.9 | 2.8 | |
| DeepSeek-V2 7B | Strict | 18.8 | 4,500 | 1.3 |
| Optimal | 48.1 | 6,800 | 2.1 | |
| Relaxed | 65.0 | 9,100 | 3.0 | |
| LLaMA 3 8B | Strict | 10.0 | 1,330 | 1.1 |
| Optimal | 40.0 | 5,320 | 1.8 | |
| Relaxed | 59.4 | 7,900 | 2.6 | |
| Mistral 7B | Strict | 21.9 | 2,100 | 1.3 |
| Optimal | 53.1 | 3,500 | 2.3 | |
| Relaxed | 68.8 | 6,550 | 3.3 |
For all models, the feasibility constraint for all properties is satisfied. This confers formal guarantees: for any compressed configuration produced by the optimization, every STL-specified linguistic property holds throughout inference on all prompts, with no weight updates to the model.
6. Theoretical Properties and Guarantees
The principal theoretical contribution is a guarantee that any compressed LLM generated via this STL-constrained process upholds all the formalized properties: if for all and all , then the compressed model satisfies those properties for all data . The proof is by induction on the structure of the STL formula and relies on the monotonic effect of quantization and pruning on the monitored output traces.
No retraining or fine-tuning is required; the compression is purely post hoc. As a result, model deployment on resource-constrained edge hardware is possible with provable preservation of selected high-level behaviors.
7. Limitations and Prospective Extensions
Major limitations of current STL Robustness-Guided Bayesian Optimization as instantiated in TOGGLE include:
- Static STL thresholds () are fixed a priori and not adaptively recalibrated.
- Only unstructured pruning is supported; structured options such as head-level or channel pruning are not incorporated.
- The cost model employs FLOPs as a proxy for resource efficiency; real-world latency or memory footprint is not directly optimized and may diverge from the FLOPs metric.
- BO search is challenged in very high-dimensional spaces, with finite query budgets potentially missing exotic or highly sparse solutions.
- STL monitors are limited to local traces () and do not capture global, long-horizon behaviors outside this window.
- No explicit hardware-in-the-loop evaluation; actual performance on diverse edge devices may differ from optimized metrics.
Potential directions for future enhancements involve integrating latency and memory-aware objectives into BO, supporting structured pruning modalities, extending to multi-modal or multi-task models, and enabling online adaptation of STL thresholds in response to runtime statistics (Khalil et al., 18 Dec 2025).