Papers
Topics
Authors
Recent
Search
2000 character limit reached

ParetoQ: MDP & LLM Trade-off Optimization

Updated 17 May 2026
  • ParetoQ is a framework that applies Pareto optimality to balance trade-offs in multi-objective model checking and low-bit LLM quantization.
  • It uses value iteration to efficiently approximate the Pareto front in MDPs, enabling scalable, time-bounded analysis that overcomes LP-based limitations.
  • In LLM quantization, ParetoQ unifies training schedules, quantization functions, and hardware considerations to optimize the memory-accuracy trade-off.

ParetoQ refers to at least two distinct frameworks rooted in the application of Pareto optimality to high-dimensional probabilistic or quantized systems, appearing both in multi-objective model checking of Markov decision processes (MDPs) and in the rigorously unified evaluation of extremely low-bit quantization schemes for LLMs. Despite arising in disparate fields, both incarnations of ParetoQ address trade-offs between mutually competing objectives—such as size versus accuracy or multiple reward/constraint dimensions—by efficiently approximating the set of Pareto-efficient solutions under problem-appropriate constraints.

1. ParetoQ in Multi-objective Model Checking

ParetoQ, as introduced in the context of probabilistic model checking, is a value-iteration-based algorithmic framework and tool for the analysis of Markov decision processes with respect to multiple, possibly conflicting, quantitative objectives. Classical approaches to multi-objective model checking rely on linear programming (LP), but these struggle to scale to large systems and handle finite-horizon (time-bounded) properties efficiently. ParetoQ overcomes these limitations by successively approximating the Pareto front of achievable objective vectors, employing value iteration rather than LP to yield significant efficiency gains (Forejt et al., 2012).

Formal Structure

  • MDP Definition: An MDP M=(S,s0,A,δ)M = (S, s_0, \mathcal{A}, \delta) comprises a finite state set SS, initial state s0s_0, finite action set A\mathcal{A}, and transition probability function δ:S×ADist(S)\delta : S\times\mathcal{A}\rightarrow \text{Dist}(S).
  • Adversary/Policy: A possibly randomized, history-dependent adversary σ\sigma resolves nondeterminism, producing a probability measure PrMσ\Pr_M^\sigma on infinite paths.
  • Objectives:
    • Reachability [T]pk[T]^{\leq p^k}: Probability of reaching target TT within kk steps is SS0 (or SS1).
    • Reward SS2: Expected cumulative reward under SS3 steps and reward structure SS4 is SS5 (or SS6).
  • Pareto Query: Let SS7 be the set of all achievable objective vectors. The Pareto front SS8 comprises those SS9 such that there is no s0s_00 with s0s_01 and s0s_02 for some s0s_03.

Successive Pareto-curve Approximation

ParetoQ iteratively constructs an s0s_04-approximation to the Pareto front by maintaining a finite set of already-found Pareto-efficient points, then:

  1. Computing their downward-closed convex hull.
  2. Checking if the desired accuracy or region coverage is reached.
  3. If not, finding a weight vector s0s_05 that exposes an uncovered region.
  4. Solving the weighted sum maximization via value iteration:
    • This step yields the optimal adversary s0s_06 and corresponding objective vector s0s_07.
    • s0s_08 is added to the working set.
  5. The process repeats until no uncovered regions remain above a tolerance s0s_09.

This approach replaces expensive global LP solves with repeated, fast weighted-sum value-iteration steps, each with complexity A\mathcal{A}0 per objective. For A\mathcal{A}1 objectives, the achievable set A\mathcal{A}2 forms a convex polytope, allowing for tractable support-hyperplane management when A\mathcal{A}3 is small (typically 2 or 3).

Support for Time-bounded Properties and Scalability

ParetoQ’s reduction of time-bounded reachability to one-off reward computations enables seamless integration of finite-horizon objectives, which are intractable for canonical LP approaches. The implementation within PRISM’s sparse engine achieves state-space scalability of up to A\mathcal{A}4 states—over an order of magnitude improvement compared to previous tools. Gauss–Seidel updates and memory-efficient vector storage facilitate practical deployment.

Applications and Empirical Performance

Key applications include:

  • Controller synthesis (e.g., scheduling on DAG job graphs), where the Pareto front directly visualizes trade-offs (e.g., a 10% speed-up may require 15% extra energy).
  • Compositional verification, supporting automated contract synthesis and failure analysis for interacting components.
  • Benchmarking demonstrates speedups up to A\mathcal{A}5 over LP-based tools and full support for multi-dimensional time-bounded queries.

2. ParetoQ in Extremely Low-bit LLM Quantization

ParetoQ also denotes a comprehensive framework for the joint design, quantification, and empirical assessment of quantized LLMs across 1, 1.58, 2, 3, and 4 bit settings, unifying training schedules, quantization functions, and model size–accuracy trade-offs within a single Pareto-optimality-centric methodology (Liu et al., 4 Feb 2025).

Unified Size–Accuracy Pareto Frontier

The framework establishes that, for a fixed memory budget, one can trade bit-width A\mathcal{A}6 for parameter count A\mathcal{A}7, yielding the effective quantized model size:

A\mathcal{A}8

where A\mathcal{A}9 and δ:S×ADist(S)\delta : S\times\mathcal{A}\rightarrow \text{Dist}(S)0 are embedding layer parameters/bitwidths (typically 8 or 16).

ParetoQ defines a joint optimization over:

  • Parameter count δ:S×ADist(S)\delta : S\times\mathcal{A}\rightarrow \text{Dist}(S)1
  • Training token count δ:S×ADist(S)\delta : S\times\mathcal{A}\rightarrow \text{Dist}(S)2
  • Quantization bit width δ:S×ADist(S)\delta : S\times\mathcal{A}\rightarrow \text{Dist}(S)3
  • Training schedule δ:S×ADist(S)\delta : S\times\mathcal{A}\rightarrow \text{Dist}(S)4 (fraction of tokens in full-precision pretraining vs. QAT)
  • Quantization function δ:S×ADist(S)\delta : S\times\mathcal{A}\rightarrow \text{Dist}(S)5

A configuration is Pareto-efficient if no other choice achieves higher accuracy for the same or smaller memory footprint.

Quantization Schemes and Mathematical Formulation

  • Uniform Quantization uses

δ:S×ADist(S)\delta : S\times\mathcal{A}\rightarrow \text{Dist}(S)6

with a learnable scale δ:S×ADist(S)\delta : S\times\mathcal{A}\rightarrow \text{Dist}(S)7, integer range δ:S×ADist(S)\delta : S\times\mathcal{A}\rightarrow \text{Dist}(S)8, and straight-through estimator gradients.

  • Elastic Binarization (1 bit): δ:S×ADist(S)\delta : S\times\mathcal{A}\rightarrow \text{Dist}(S)9.
  • Stretched Elastic Quantization (SEQ) (1.58 and 2 bits): Incorporates symmetric, zero-inclusive grids for ternary (3-level, 1.58-bit) and quaternary (2-bit) quantization.
  • Learned Step-Size Quantization (LSQ) (3 and 4 bits): Standard quantizer with scale parameter learned via gradient descent.

Empirical Scaling Laws and Learning Regimes

  • Learning transition at 2–3 bits: Empirical drift metric

σ\sigma0

reveals σ\sigma1–σ\sigma2 for 3 and 4 bit, but σ\sigma3 for 1/1.58/2 bit, signifying that sub-3 bit QAT requires weight reconstruction, not merely compensation.

  • Token allocation: Across all bit widths, optimal results are obtained with σ\sigma4 of training tokens allocated to full-precision pretraining, σ\sigma5 to QAT (as opposed to QAT from scratch or pure PTQ).
  • Fine-tuning length: Saturation is reached after σ\sigma6B tokens for 1–2 bit regimes, σ\sigma7B tokens for 3–4 bit.

Size–Accuracy Trade-off and Hardware Considerations

  • Memory savings: 2 bit models achieve σ\sigma8 reduction over 16 bit, 1.58 bit (ternary) realises σ\sigma9.
  • Throughput: 2 bit quantization yields up to PrMσ\Pr_M^\sigma0 higher tokens/s than 4 bit on Apple M1; on NVIDIA H100 GPUs (custom CUTLASS kernels), 2 bit is PrMσ\Pr_M^\sigma1 faster than FP16.
  • Accuracy: In eight zero-shot reasoning tasks and WikiText-2 perplexity, 1.58, 2, and 3 bit quantization consistently match or exceed 4 bit at lower memory.

Recommended quantizer and bit-width are thus dictated by hardware support and application constraints:

  • 2 bit: best trade-off for accuracy, efficiency, memory.
  • 1.58 bit: minimal footprint, matches 4 bit benchmarks.
  • 3 bit: highly accurate but hardware packing complexities.

3. Comparison of ParetoQ Frameworks

Context Objectives Methodology Notable Achievements
Model Checking (Forejt et al., 2012) Multi-objective (reach./reward) Successive weighted-sum value iteration Time-bounded analysis, PrMσ\Pr_M^\sigma2 size, PrMσ\Pr_M^\sigma3 speed
LLM Quantization (Liu et al., 4 Feb 2025) Model size vs. accuracy Unified QAT, quantizer+recipe search New frontier: 1.58/2/3 bit dominate 4 bit

Both approaches leveraging ParetoQ illustrate how Pareto efficiency principles enable principled, scalable trade-off navigation in distinctly different settings.

4. Practical Recommendations and Empirical Performance

  • For multi-objective model checking, use ParetoQ for systems where state space or time-bounded properties render LP-based solvers intractable. Empirical evidence shows routine handling of models PrMσ\Pr_M^\sigma4 states, far surpassing previous bounds (Forejt et al., 2012).
  • For LLM quantization, select bit width based on hardware and accuracy needs. Favor 2 bit quantization if supported; otherwise, consider 1.58 bit for maximum compression or 3 bit for near-maximal accuracy. Always allocate the majority of training to full-precision, with QAT fine-tuning for quantized adaptation. ParetoQ’s empirical law indicates no advantage of 4 bit beyond implementation simplicity (Liu et al., 4 Feb 2025).

5. Foundations and Significance

Both dimensions of ParetoQ clearly demonstrate the utility of Pareto front exploration for surfacing optimal trade-offs in complex systems. In model checking, this enables explicit, human-readable visualizations of multi-objective performance landscapes for synthesis and verification. In quantized LLMs, it underpins a rigorous, “apples-to-apples” basis for comparing quantization granularities, upends earlier consensus on optimal bit-widths, and connects optimization directly with practical system bottlenecks.

The ParetoQ paradigm thus constitutes a methodological advancement that pairs mathematical abstraction (convexity, value iteration, Pareto optimality) with efficient, scalable algorithmic instantiations and empirically validated recipes for real-world multi-criteria optimization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ParetoQ.