Papers
Topics
Authors
Recent
2000 character limit reached

Temporal Logic-Guided LLM Compression (TOGGLE)

Updated 25 December 2025
  • The paper presents TOGGLE, a framework that uses STL and robustness-guided Bayesian Optimization to compress LLMs without the need for retraining.
  • It achieves up to 3.3× FLOPs reduction and 68.8% model size decrease while ensuring critical linguistic properties like coherence and factual accuracy.
  • The method optimizes layer-wise quantization and pruning via Gaussian Processes, facilitating efficient model deployment on resource-constrained edge hardware.

Temporal Logic-Guided LLM Compression (TOGGLE) is a formal-methods-based framework designed to produce compressed LLMs for edge deployment. It employs Signal Temporal Logic (STL) to explicitly specify and guarantee the preservation of essential linguistic properties during the compression process. Unlike standard quantization, pruning, or knowledge distillation techniques, TOGGLE uses robustness-guided Bayesian Optimization to explore the layer-wise quantization and pruning configuration space, yielding models that obey user-defined linguistic constraints without retraining or fine-tuning. By leveraging STL-based formal specification, TOGGLE delivers compressed LLMs with up to 3.3× computational cost reduction and 68.8% model size reduction, while ensuring all specified properties are maintained, enabling deployability on resource-constrained edge hardware (Khalil et al., 18 Dec 2025).

1. Formal Foundations: Signal Temporal Logic and Linguistic Properties

TOGGLE integrates Signal Temporal Logic (STL) as its core formalism for specifying and verifying linguistic properties during compression. STL expresses temporal properties over real-valued signals, using atomic predicates μ::=f(x(t))0\mu ::= f(x(t)) \geq 0 applied to time-dependent vectors x(t)x(t). Formulae are recursively constructed:

  • φ::=μ  ¬φ  φ1φ2  G[a,b]φ  F[a,b]φ  φ1U[a,b]φ2\varphi ::= \mu~|~\neg\varphi~|~\varphi_1 \land \varphi_2~|~G_{[a,b]}\varphi~|~F_{[a,b]}\varphi~|~\varphi_1 U_{[a,b]} \varphi_2, for “always”, “eventually”, and “until” modalities on intervals [a,b][a,b].

Quantitative semantics are defined by the robustness function ρφ(x,t)\rho_\varphi(x,t), computed inductively:

  • For atomic: ρf(x)0(x,t)=f(x(t))\rho_{f(x)\geq 0}(x,t) = f(x(t)).
  • Negation: ρ¬φ(x,t)=ρφ(x,t)\rho_{\neg\varphi}(x,t) = -\rho_\varphi(x,t).
  • Conjunction: ρφ1φ2(x,t)=min(ρφ1(x,t),ρφ2(x,t))\rho_{\varphi_1\land\varphi_2}(x,t) = \min(\rho_{\varphi_1}(x,t), \rho_{\varphi_2}(x,t)).
  • Always: ρG[a,b]φ(x,t)=mint[t+a,t+b]ρφ(x,t)\rho_{G_{[a,b]} \varphi}(x,t) = \min_{t'\in[t+a, t+b]} \rho_\varphi(x,t').
  • Eventually: ρF[a,b]φ(x,t)=maxt[t+a,t+b]ρφ(x,t)\rho_{F_{[a,b]} \varphi}(x,t) = \max_{t'\in[t+a, t+b]} \rho_\varphi(x,t').
  • Until: ρφ1U[a,b]φ2(x,t)=supt[t+a,t+b]min(ρφ2(x,t),infτ[t,t]ρφ1(x,τ))\rho_{\varphi_1 U_{[a,b]} \varphi_2}(x,t) = \sup_{t'\in[t+a, t+b]} \min(\rho_{\varphi_2}(x,t'), \inf_{\tau\in[t,t']} \rho_{\varphi_1}(x,\tau)).

In TOGGLE, STL predicates are constructed for:

  • Sequential coherence: Jensen–Shannon divergence on next-token distributions,
  • Long-range dependency: cosine similarity on attention maps,
  • Contextual consistency: cosine similarity on hidden-state embeddings,
  • Factual accuracy: probability-ratio on ground-truth token probabilities.

For a model configuration KK and dataset DD, property φi\varphi_i yields minimum robustness:

ρimin(K)=mindDmint=1...Tρφi(Od,K,t)\rho_i^{\min}(K) = \min_{d\in D} \min_{t=1...T'} \rho_{\varphi_i}(O^{d,K}, t)

where Od,K(t)O^{d,K}(t) denotes monitored model outputs.

2. Robustness-Guided Bayesian Optimization Procedure

Compression configurations are selected via Bayesian Optimization (BO) guided by STL robustness. The optimization problem is:

K=argminKCE(K)s.t. i, ρimin(K)θiK^* = \arg\min_{K\in\mathcal{C}} E(K) \quad \text{s.t. } \forall i,~\rho_i^{\min}(K) \geq \theta_i

where E(K)E(K) is the estimated FLOPs cost, and θi\theta_i are robustness thresholds (typically $0$).

Independent Gaussian Processes (GPs) model:

  • f(K)=E(K)f(K)=E(K) (cost surrogate)
  • gi(K)=ρimin(K)g_i(K)=\rho_i^{\min}(K) (property robustness)

At each BO iteration, the acquisition function

α(K)=EIf(K)×P[igi(K)θi]\alpha(K) = EI_f(K) \times P\left[\bigwedge_i g_i(K) \geq \theta_i\right]

is maximized, balancing cost reduction with formal property satisfaction, where EIf(K)EI_f(K) computes the expected improvement on FLOPs, and P[]P[\cdots] is the joint probability all gig_i exceed their thresholds.

3. Compression Parameterization and Cost Modeling

LLMs are parameterized by layer-wise configurations:

  • For layer ll and compressible component cClc\in C_l:
    • Bit-width bl,c{2,3,...,16}b_{l,c} \in \{2,3,...,16\}
    • Pruning ratio pl,c{0.0,0.1,...,0.5}p_{l,c} \in \{0.0, 0.1, ..., 0.5\}

A configuration vector KK concatenates all (bl,c,pl,c)(b_{l,c},p_{l,c}), forming a discrete search space C\mathcal{C} with exponential cardinality.

The computational cost of a configuration is:

E(K)=Cl,c(1pl,c)Wl,cSbl,cbrefE(K) = C \sum_{l,c} (1-p_{l,c})|W_{l,c}| S \frac{b_{l,c}}{b_{ref}}

where Wl,c|W_{l,c}| is parameter count, SS sequence length, bref=16b_{ref}=16, and C2C\approx 2 for multiply-accumulate ops.

Evaluating constraint gi(K)g_i(K) involves running a forward pass for each data sample, collecting outputs, and computing STL robustness.

4. Algorithmic Workflow and Operating Modes

TOGGLE’s pipeline consists of:

  1. Linguistic Property Specification: Encode coherence, dependency, consistency, and factuality as STL formulae φ1...φ4\varphi_1...\varphi_4, each over [1,T][1,T'].
  2. Initial Sampling: Randomly sample 10\sim 10 configurations to fit GP priors.
  3. BO Loop: For each of up to N=200N=200 iterations:
    • Evaluate α(K)\alpha(K), select Kk=argmaxα(K)K_k = \arg\max \alpha(K).
    • Evaluate E(Kk)E(K_k) and gi(Kk)g_i(K_k) via model instantiation and inference on DD.
    • Update GPs with new (Kk,E(Kk))(K_k, E(K_k)) and (Kk,gi(Kk))(K_k, g_i(K_k)).
  4. Feasible Set and Pareto Front Discovery: Identify all KK where gi(K)θig_i(K)\geq \theta_i; plot FLOPs vs. minimum robustness.
  5. Operating Mode Selection: For user-specified AvgPP (Average Property Preservation, 99%, 95%, 85%), select KK with the desired property guarantee and minimal cost; no retraining occurs.

5. Empirical Evaluation Across Architectures and Properties

TOGGLE was evaluated on GPT-2 (124M), DeepSeek-V2 7B, LLAMA 3 8B, and Mistral 7B. Datasets and thresholds per property:

Property Dataset Metric / Threshold
Sequential coherence LAMBADA JSD, ϵ=0.25\epsilon=0.25
Long-range dependency WikiText-2 cosine, δ=0.70\delta=0.70
Contextual consistency Dialogue corpus cosine, γ=0.70\gamma=0.70
Factual accuracy TruthfulQA ratio, τ=0.70\tau=0.70

Representative compression results (“Optimal mode”):

Model Baseline FLOPs/Tok FLOPs/Tok (× reduc.) Model Size ↓ (%)
GPT-2 0.08 0.05 (1.6×) 43.8
DeepSeek-7B 9.5 5.9 (1.6×) 48.1
LLAMA 8B 14.6 8.1 (1.8×) 40.0
Mistral 7B 9.5 5.4 (1.8×) 53.1

“Relaxed mode” achieves up to 3.3× FLOPs and 68.8% size reduction (Mistral 7B).

In all configurations, ρimin(K)0\rho_i^{\min}(K)\geq 0, formally certifying property satisfaction; AvgPP aligns with 99%/95%/85% targets.

6. Baseline Comparisons and Formal Guarantees

TOGGLE was explicitly compared to:

  • Uniform 8-bit quantization and 50% pruning: failed to meet robustness thresholds for long-range or factual metrics.
  • Distillation (student models): required retraining, lacked STL-based formal satisfaction, and produced lower quality on LAMBADA and TruthfulQA.

TOGGLE uniquely provides:

  • Verifiable satisfaction (ρimin(K)0\rho_i^{\min}(K)\geq 0) for all specified properties.
  • Larger compression ratios for equal drops in task performance.
  • Compression with no retraining overhead (Khalil et al., 18 Dec 2025).

7. Edge Deployment and Operational Considerations

On ARM-based SoCs with ≤4GB on-chip memory and no large FP16 units, TOGGLE-compressed models exploit INT4/INT8 operations and sparsity for efficient inference. The observed 2–3.3× reduction in GFLOPs/token yields 1.5–2.5× speedups on INT-enabled NPUs and reduces memory footprint by 40–69%, enabling deployment within constrained local caches.

Deployment process:

  • Compressed weight/mask files require no retraining.
  • STL verification runs as an offline step during BO; released models carry a formal guarantee (verify-stamp).
  • Run-time quantization/pruning switches allow for “Strict” (quality) vs. “Relaxed” (energy-saving) operation without reprocessing.
  • Kernel compatibility with accelerated inference engines (e.g., TensorRT, ONNX Runtime) is achieved via bespoke mixed-precision and sparse kernel export.

By integrating formal linguistic guarantees, TOGGLE enables reliable, efficient, and formally verified LLM deployment on edge hardware without compromising critical model behaviors (Khalil et al., 18 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Temporal Logic-Guided Large Language Model Compression (TOGGLE).