Temporal Logic-Guided LLM Compression (TOGGLE)
- The paper presents TOGGLE, a framework that uses STL and robustness-guided Bayesian Optimization to compress LLMs without the need for retraining.
- It achieves up to 3.3× FLOPs reduction and 68.8% model size decrease while ensuring critical linguistic properties like coherence and factual accuracy.
- The method optimizes layer-wise quantization and pruning via Gaussian Processes, facilitating efficient model deployment on resource-constrained edge hardware.
Temporal Logic-Guided LLM Compression (TOGGLE) is a formal-methods-based framework designed to produce compressed LLMs for edge deployment. It employs Signal Temporal Logic (STL) to explicitly specify and guarantee the preservation of essential linguistic properties during the compression process. Unlike standard quantization, pruning, or knowledge distillation techniques, TOGGLE uses robustness-guided Bayesian Optimization to explore the layer-wise quantization and pruning configuration space, yielding models that obey user-defined linguistic constraints without retraining or fine-tuning. By leveraging STL-based formal specification, TOGGLE delivers compressed LLMs with up to 3.3× computational cost reduction and 68.8% model size reduction, while ensuring all specified properties are maintained, enabling deployability on resource-constrained edge hardware (Khalil et al., 18 Dec 2025).
1. Formal Foundations: Signal Temporal Logic and Linguistic Properties
TOGGLE integrates Signal Temporal Logic (STL) as its core formalism for specifying and verifying linguistic properties during compression. STL expresses temporal properties over real-valued signals, using atomic predicates applied to time-dependent vectors . Formulae are recursively constructed:
- , for “always”, “eventually”, and “until” modalities on intervals .
Quantitative semantics are defined by the robustness function , computed inductively:
- For atomic: .
- Negation: .
- Conjunction: .
- Always: .
- Eventually: .
- Until: .
In TOGGLE, STL predicates are constructed for:
- Sequential coherence: Jensen–Shannon divergence on next-token distributions,
- Long-range dependency: cosine similarity on attention maps,
- Contextual consistency: cosine similarity on hidden-state embeddings,
- Factual accuracy: probability-ratio on ground-truth token probabilities.
For a model configuration and dataset , property yields minimum robustness:
where denotes monitored model outputs.
2. Robustness-Guided Bayesian Optimization Procedure
Compression configurations are selected via Bayesian Optimization (BO) guided by STL robustness. The optimization problem is:
where is the estimated FLOPs cost, and are robustness thresholds (typically $0$).
Independent Gaussian Processes (GPs) model:
- (cost surrogate)
- (property robustness)
At each BO iteration, the acquisition function
is maximized, balancing cost reduction with formal property satisfaction, where computes the expected improvement on FLOPs, and is the joint probability all exceed their thresholds.
3. Compression Parameterization and Cost Modeling
LLMs are parameterized by layer-wise configurations:
- For layer and compressible component :
- Bit-width
- Pruning ratio
A configuration vector concatenates all , forming a discrete search space with exponential cardinality.
The computational cost of a configuration is:
where is parameter count, sequence length, , and for multiply-accumulate ops.
Evaluating constraint involves running a forward pass for each data sample, collecting outputs, and computing STL robustness.
4. Algorithmic Workflow and Operating Modes
TOGGLE’s pipeline consists of:
- Linguistic Property Specification: Encode coherence, dependency, consistency, and factuality as STL formulae , each over .
- Initial Sampling: Randomly sample configurations to fit GP priors.
- BO Loop: For each of up to iterations:
- Evaluate , select .
- Evaluate and via model instantiation and inference on .
- Update GPs with new and .
- Feasible Set and Pareto Front Discovery: Identify all where ; plot FLOPs vs. minimum robustness.
- Operating Mode Selection: For user-specified AvgPP (Average Property Preservation, 99%, 95%, 85%), select with the desired property guarantee and minimal cost; no retraining occurs.
5. Empirical Evaluation Across Architectures and Properties
TOGGLE was evaluated on GPT-2 (124M), DeepSeek-V2 7B, LLAMA 3 8B, and Mistral 7B. Datasets and thresholds per property:
| Property | Dataset | Metric / Threshold |
|---|---|---|
| Sequential coherence | LAMBADA | JSD, |
| Long-range dependency | WikiText-2 | cosine, |
| Contextual consistency | Dialogue corpus | cosine, |
| Factual accuracy | TruthfulQA | ratio, |
Representative compression results (“Optimal mode”):
| Model | Baseline FLOPs/Tok | FLOPs/Tok (× reduc.) | Model Size ↓ (%) |
|---|---|---|---|
| GPT-2 | 0.08 | 0.05 (1.6×) | 43.8 |
| DeepSeek-7B | 9.5 | 5.9 (1.6×) | 48.1 |
| LLAMA 8B | 14.6 | 8.1 (1.8×) | 40.0 |
| Mistral 7B | 9.5 | 5.4 (1.8×) | 53.1 |
“Relaxed mode” achieves up to 3.3× FLOPs and 68.8% size reduction (Mistral 7B).
In all configurations, , formally certifying property satisfaction; AvgPP aligns with 99%/95%/85% targets.
6. Baseline Comparisons and Formal Guarantees
TOGGLE was explicitly compared to:
- Uniform 8-bit quantization and 50% pruning: failed to meet robustness thresholds for long-range or factual metrics.
- Distillation (student models): required retraining, lacked STL-based formal satisfaction, and produced lower quality on LAMBADA and TruthfulQA.
TOGGLE uniquely provides:
- Verifiable satisfaction () for all specified properties.
- Larger compression ratios for equal drops in task performance.
- Compression with no retraining overhead (Khalil et al., 18 Dec 2025).
7. Edge Deployment and Operational Considerations
On ARM-based SoCs with ≤4GB on-chip memory and no large FP16 units, TOGGLE-compressed models exploit INT4/INT8 operations and sparsity for efficient inference. The observed 2–3.3× reduction in GFLOPs/token yields 1.5–2.5× speedups on INT-enabled NPUs and reduces memory footprint by 40–69%, enabling deployment within constrained local caches.
Deployment process:
- Compressed weight/mask files require no retraining.
- STL verification runs as an offline step during BO; released models carry a formal guarantee (verify-stamp).
- Run-time quantization/pruning switches allow for “Strict” (quality) vs. “Relaxed” (energy-saving) operation without reprocessing.
- Kernel compatibility with accelerated inference engines (e.g., TensorRT, ONNX Runtime) is achieved via bespoke mixed-precision and sparse kernel export.
By integrating formal linguistic guarantees, TOGGLE enables reliable, efficient, and formally verified LLM deployment on edge hardware without compromising critical model behaviors (Khalil et al., 18 Dec 2025).