Temporal Logic-Guided LLM Compression (TOGGLE)

Updated 25 December 2025

The paper presents TOGGLE, a framework that uses STL and robustness-guided Bayesian Optimization to compress LLMs without the need for retraining.
It achieves up to 3.3× FLOPs reduction and 68.8% model size decrease while ensuring critical linguistic properties like coherence and factual accuracy.
The method optimizes layer-wise quantization and pruning via Gaussian Processes, facilitating efficient model deployment on resource-constrained edge hardware.

Temporal Logic-Guided LLM Compression (TOGGLE) is a formal-methods-based framework designed to produce compressed LLMs for edge deployment. It employs Signal Temporal Logic (STL) to explicitly specify and guarantee the preservation of essential linguistic properties during the compression process. Unlike standard quantization, pruning, or knowledge distillation techniques, TOGGLE uses robustness-guided Bayesian Optimization to explore the layer-wise quantization and pruning configuration space, yielding models that obey user-defined linguistic constraints without retraining or fine-tuning. By leveraging STL-based formal specification, TOGGLE delivers compressed LLMs with up to 3.3× computational cost reduction and 68.8% model size reduction, while ensuring all specified properties are maintained, enabling deployability on resource-constrained edge hardware (Khalil et al., 18 Dec 2025).

1. Formal Foundations: Signal Temporal Logic and Linguistic Properties

TOGGLE integrates Signal Temporal Logic (STL) as its core formalism for specifying and verifying linguistic properties during compression. STL expresses temporal properties over real-valued signals, using atomic predicates $\mu ::= f(x(t)) \geq 0$ applied to time-dependent vectors $x(t)$ . Formulae are recursively constructed:

$\varphi ::= \mu~|~\neg\varphi~|~\varphi_1 \land \varphi_2~|~G_{[a,b]}\varphi~|~F_{[a,b]}\varphi~|~\varphi_1 U_{[a,b]} \varphi_2$ , for “always”, “eventually”, and “until” modalities on intervals $[a,b]$ .

Quantitative semantics are defined by the robustness function $\rho_\varphi(x,t)$ , computed inductively:

For atomic: $\rho_{f(x)\geq 0}(x,t) = f(x(t))$ .
Negation: $\rho_{\neg\varphi}(x,t) = -\rho_\varphi(x,t)$ .
Conjunction: $\rho_{\varphi_1\land\varphi_2}(x,t) = \min(\rho_{\varphi_1}(x,t), \rho_{\varphi_2}(x,t))$ .
Always: $\rho_{G_{[a,b]} \varphi}(x,t) = \min_{t'\in[t+a, t+b]} \rho_\varphi(x,t')$ .
Eventually: $\rho_{F_{[a,b]} \varphi}(x,t) = \max_{t'\in[t+a, t+b]} \rho_\varphi(x,t')$ .
Until: $\rho_{\varphi_1 U_{[a,b]} \varphi_2}(x,t) = \sup_{t'\in[t+a, t+b]} \min(\rho_{\varphi_2}(x,t'), \inf_{\tau\in[t,t']} \rho_{\varphi_1}(x,\tau))$ .

In TOGGLE, STL predicates are constructed for:

Sequential coherence: Jensen–Shannon divergence on next-token distributions,
Long-range dependency: cosine similarity on attention maps,
Contextual consistency: cosine similarity on hidden-state embeddings,
Factual accuracy: probability-ratio on ground-truth token probabilities.

For a model configuration $K$ and dataset $D$ , property $\varphi_i$ yields minimum robustness:

$\rho_i^{\min}(K) = \min_{d\in D} \min_{t=1...T'} \rho_{\varphi_i}(O^{d,K}, t)$

where $O^{d,K}(t)$ denotes monitored model outputs.

2. Robustness-Guided Bayesian Optimization Procedure

Compression configurations are selected via Bayesian Optimization (BO) guided by STL robustness. The optimization problem is:

$K^* = \arg\min_{K\in\mathcal{C}} E(K) \quad \text{s.t. } \forall i,~\rho_i^{\min}(K) \geq \theta_i$

where $E(K)$ is the estimated FLOPs cost, and $\theta_i$ are robustness thresholds (typically $0$).

Independent Gaussian Processes (GPs) model:

$f(K)=E(K)$ (cost surrogate)
$g_i(K)=\rho_i^{\min}(K)$ (property robustness)

At each BO iteration, the acquisition function

$\alpha(K) = EI_f(K) \times P\left[\bigwedge_i g_i(K) \geq \theta_i\right]$

is maximized, balancing cost reduction with formal property satisfaction, where $EI_f(K)$ computes the expected improvement on FLOPs, and $P[\cdots]$ is the joint probability all $g_i$ exceed their thresholds.

3. Compression Parameterization and Cost Modeling

LLMs are parameterized by layer-wise configurations:

For layer $l$ $l$ and compressible component $c\in C_l$ $c \in C_{l}$ :
- Bit-width $b_{l,c} \in \{2,3,...,16\}$
- Pruning ratio $p_{l,c} \in \{0.0, 0.1, ..., 0.5\}$

A configuration vector $K$ concatenates all $(b_{l,c},p_{l,c})$ , forming a discrete search space $\mathcal{C}$ with exponential cardinality.

The computational cost of a configuration is:

$E(K) = C \sum_{l,c} (1-p_{l,c})|W_{l,c}| S \frac{b_{l,c}}{b_{ref}}$

where $|W_{l,c}|$ is parameter count, $S$ sequence length, $b_{ref}=16$ , and $C\approx 2$ for multiply-accumulate ops.

Evaluating constraint $g_i(K)$ involves running a forward pass for each data sample, collecting outputs, and computing STL robustness.

4. Algorithmic Workflow and Operating Modes

TOGGLE’s pipeline consists of:

Linguistic Property Specification: Encode coherence, dependency, consistency, and factuality as STL formulae $\varphi_1...\varphi_4$ , each over $[1,T']$ .
Initial Sampling: Randomly sample $\sim 10$ configurations to fit GP priors.
BO Loop: For each of up to $N=200$ $N = 200$ iterations:
- Evaluate $\alpha(K)$ , select $K_k = \arg\max \alpha(K)$ .
- Evaluate $E(K_k)$ and $g_i(K_k)$ via model instantiation and inference on $D$ .
- Update GPs with new $(K_k, E(K_k))$ and $(K_k, g_i(K_k))$ .
Feasible Set and Pareto Front Discovery: Identify all $K$ where $g_i(K)\geq \theta_i$ ; plot FLOPs vs. minimum robustness.
Operating Mode Selection: For user-specified AvgPP (Average Property Preservation, 99%, 95%, 85%), select $K$ with the desired property guarantee and minimal cost; no retraining occurs.

5. Empirical Evaluation Across Architectures and Properties

TOGGLE was evaluated on GPT-2 (124M), DeepSeek-V2 7B, LLAMA 3 8B, and Mistral 7B. Datasets and thresholds per property:

Property	Dataset	Metric / Threshold
Sequential coherence	LAMBADA	JSD, $\epsilon=0.25$
Long-range dependency	WikiText-2	cosine, $\delta=0.70$
Contextual consistency	Dialogue corpus	cosine, $\gamma=0.70$
Factual accuracy	TruthfulQA	ratio, $\tau=0.70$

Representative compression results (“Optimal mode”):

Model	Baseline FLOPs/Tok	FLOPs/Tok (× reduc.)	Model Size ↓ (%)
GPT-2	0.08	0.05 (1.6×)	43.8
DeepSeek-7B	9.5	5.9 (1.6×)	48.1
LLAMA 8B	14.6	8.1 (1.8×)	40.0
Mistral 7B	9.5	5.4 (1.8×)	53.1

“Relaxed mode” achieves up to 3.3× FLOPs and 68.8% size reduction (Mistral 7B).

In all configurations, $\rho_i^{\min}(K)\geq 0$ , formally certifying property satisfaction; AvgPP aligns with 99%/95%/85% targets.

6. Baseline Comparisons and Formal Guarantees

TOGGLE was explicitly compared to:

Uniform 8-bit quantization and 50% pruning: failed to meet robustness thresholds for long-range or factual metrics.
Distillation (student models): required retraining, lacked STL-based formal satisfaction, and produced lower quality on LAMBADA and TruthfulQA.

TOGGLE uniquely provides:

Verifiable satisfaction ( $\rho_i^{\min}(K)\geq 0$ ) for all specified properties.
Larger compression ratios for equal drops in task performance.
Compression with no retraining overhead (Khalil et al., 18 Dec 2025).

7. Edge Deployment and Operational Considerations

On ARM-based SoCs with ≤4GB on-chip memory and no large FP16 units, TOGGLE-compressed models exploit INT4/INT8 operations and sparsity for efficient inference. The observed 2–3.3× reduction in GFLOPs/token yields 1.5–2.5× speedups on INT-enabled NPUs and reduces memory footprint by 40–69%, enabling deployment within constrained local caches.

Deployment process:

Compressed weight/mask files require no retraining.
STL verification runs as an offline step during BO; released models carry a formal guarantee (verify-stamp).
Run-time quantization/pruning switches allow for “Strict” (quality) vs. “Relaxed” (energy-saving) operation without reprocessing.
Kernel compatibility with accelerated inference engines (e.g., TensorRT, ONNX Runtime) is achieved via bespoke mixed-precision and sparse kernel export.

By integrating formal linguistic guarantees, TOGGLE enables reliable, efficient, and formally verified LLM deployment on edge hardware without compromising critical model behaviors (Khalil et al., 18 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

TOGGLE: Temporal Logic-Guided Large Language Model Compression for Edge (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Temporal Logic-Guided Large Language Model Compression (TOGGLE).