Ternary-Weight PTQ Framework

Updated 28 September 2025

Ternary-weight PTQ is a quantization framework that maps neural network weights to {–1, 0, +1} using optimized scaling and thresholding to achieve high compression with minimal accuracy loss.
It employs methods like Fine-Grained Quantization and progressive partitioning (RPR) to optimize thresholds and group-wise scaling across diverse architectures.
The approach enables uniform multiplication-free inference and supports applications from convolutional networks to large language models in resource-constrained environments.

A Ternary-Weight PTQ (Post-Training Quantization) Framework is a set of quantization methodologies and algorithms that map full-precision neural network weights into the ternary set {–1, 0, +1} (trits), enabling highly compressed models and efficient inference in memory- and compute-constrained environments. This framework has matured to support a diverse range of architectures, including deep convolutional networks and LLMs, with systematic strategies to balance compression, computational cost, and accuracy retention.

1. Mathematical Foundations of Ternary Quantization

Ternary-weight quantization post-training typically involves minimizing the Euclidean (L2) distance between full-precision weights $W$ and their ternary counterparts $\tilde{W}$ , often incorporating scaling factors, leading to the optimization problem:

$\min_{\alpha,\, \tilde{W}} \| W - \alpha \tilde{W} \|_2^2 \quad\text{s.t.}\quad \alpha \ge 0,\; \tilde{W}_i \in \{-1, 0, +1\}\; \forall i$

Threshold-based ternarization functions are then employed, such as:

$\tilde{W}_i = \begin{cases} +1, & W_i > \Delta \ 0, & |W_i| \le \Delta \ -1, & W_i < -\Delta \end{cases}$

with the threshold $\Delta$ often set empirically as a fraction of the average modulus of weights (for example, $\Delta^* \approx 0.75 \cdot \mathbb{E}(|W|)$ (Li et al., 2016)).

To extend ternary quantization efficacy, frameworks such as Fine-Grained Quantization (FGQ) (Mellempudi et al., 2017) partition weights into groups and optimize both the threshold and scaling per group, enabling group-wise adaptation to heterogeneous weight distributions within a layer. The general quantization for group $g$ with $|g|=N$ weights is:

$\min_{\alpha,\, \tilde{w}} \| w^{(g)} - \alpha \tilde{w}^{(g)} \|^2 \qquad \text{s.t.} \; \tilde{w}_i^{(g)} \in \{-1, 0, +1\}$

Where possible, asymmetric thresholds and scaling are introduced for greater expressiveness.

Advances such as PTQ to Trit-Planes (PTQTP) (Xiao et al., 21 Sep 2025) further decompose the weight matrix $W \in \mathbb{R}^{n\times d}$ into a structured sum over multiple ternary matrices (“trit-planes”) with separate scaling vectors:

$W \approx \hat{W} = \sum_{k=1}^2 \text{diag}(\alpha^{(k)}) \cdot T^{(k)}$

where each $T^{(k)} \in \{-1, 0, +1\}^{n\times d}$ , with row-adaptive scaling coefficients $\alpha^{(k)}$ solving a local ridge regression.

2. Progressive and Structured Quantization Algorithms

To avoid the pitfalls of hard, one-step ternarization, modern frameworks employ progressive and structured quantization strategies:

Random Partition Relaxation (RPR) (Cavigelli et al., 2020) partitions weights into randomly selected “frozen” (quantized) and “relaxed” (continuously updated) subsets. Throughout training, the proportion of frozen weights increases until all are ternarized:

$w_i^{\text{eff}} = b_i \cdot Q(w_i) + (1 - b_i) \cdot w_i$

where $b_i \in \{0,1\}$ acts as the partition indicator.

PTQTP’s Progressive Approximation (Xiao et al., 21 Sep 2025): For each row, the algorithm alternates between ridge regression to re-estimate scaling coefficients and local exhaustive search to refine ternary masks, iterating to reduce error:

$\alpha_i = (S_i^\top S_i + \lambda_i I)^{-1} S_i^\top w_i^\top$

Adaptive regularization ( $\lambda_i$ ) ensures numerical stability.

Such approaches facilitate global weight consistency, with efficient convergence toward highly compressed, expressive representations.

3. Deployment and Hardware Considerations

Ternary-weight PTQ frameworks offer pronounced deployment advantages:

Uniform Multiplication-Free Inference: By restricting weights to $\{-1,0,+1\}$ , inference reduces to addition and sign inversion operations, rendering multiplication entirely obsolete. This paradigm closely mirrors hardware efficiency in binarized networks while preserving greater expressivity (Xiao et al., 21 Sep 2025, Li et al., 2016).
Model-Agnostic Integration: The structured ternary decomposition approach is architecture-neutral; it can be applied post hoc to standard dense or transformer-based layers without the need for retraining or architectural changes (Mellempudi et al., 2017, Xiao et al., 21 Sep 2025).
Mixed-Precision Extensions: Some frameworks apply ternarization selectively, preserving a fraction of weights in higher precision (e.g., 16-bit) based on task circuits, identified via loss gradient sensitivity and quantization impact scores (Xiao et al., 10 Apr 2025).

A summary of relative memory and compute savings:

PTQ Method	Quantization Scheme	Multiplication-Free	Typical Compression Ratio	Notes
Standard TWN (Li et al., 2016)	Layerwise ternary	Yes	up to 16×	$\alpha$ scaling per layer
FGQ (Mellempudi et al., 2017)	Groupwise ternary	Yes	9–15×	Finer groups, higher accuracy
PTQTP (Xiao et al., 21 Sep 2025)	Two trit-plane, rowwise scale	Yes	$\approx$ 20×+	Multiplication-free, high expressivity
RPR (Cavigelli et al., 2020)	Progressive partitioned	Yes	model-dependent	SGD-compatible, progressive freezing

4. Accuracy and Performance Preservation

Empirical results across multiple studies demonstrate that ternary quantization can yield superior accuracy-to-compression trade-offs compared with binary or naïvely quantized approaches:

TWNs on MNIST achieve 99.35% top-1 accuracy, closely matching full precision (99.42%), and 2–3% superior to binary-weighted models (Li et al., 2016).
On ImageNet, FGQ (N=4) reduces computation (eliminating 75% of multiplications) with top-1 accuracy ~73.85% (ResNet-101), within 3.7% of baseline; larger groups (N=64) eliminate >99% of multiplications, with minor accuracy loss if low-bit fine-tuning is applied (Mellempudi et al., 2017).
PTQTP achieves 82.4% mathematical reasoning retention on the Math-500 benchmark, compared to complete failures for binary PTQ (~0%), and matches or surpasses 1.58-bit QAT methods despite requiring only an hour of quantization with no retraining (Xiao et al., 21 Sep 2025).
Mixed-precision PTQ yields further improvements, enabling >95% recovery of unquantized model performance in large LLMs at very low mean bit-widths (Xiao et al., 10 Apr 2025).

5. Task-Specific and Generalized Quantization Schemes

Recent frameworks incorporate task-conditioned and circuit-aware quantization:

Task-Circuit Quantization (TaCQ) (Xiao et al., 10 Apr 2025): Quantization is guided by knowledge localization, leveraging both magnitude and loss sensitivity to identify and preserve weights critical to downstream task performance. Non-critical weights are ternarized, and a small subset (the task circuit) remains in higher precision.
This approach is effective both in task-specific calibration (preserving performance on, for example, math reasoning or SQL generation tasks) and general-purpose settings (through gradient-based saliency measures aggregated over broader data).

Such methodologies bridge the gap between aggressive quantization and robust, real-world task deployment, ensuring models can be downscaled for embedded or edge scenarios without catastrophic degradation in specialized performance domains.

6. Comparative Outlook and Future Research

Ternary-weight PTQ frameworks consistently outperform purely binary PTQ in both model expressiveness and practical downstream performance, particularly in tasks sensitive to representational collapse (e.g., mathematical reasoning). Structured two-plane ternary decompositions (as popularized by PTQTP) eliminate the need for complex compensation or mixed-precision schemes employed by earlier methods and enable efficient, uniform hardware mapping.

Open research directions include:

Joint quantization of activations with ternary-weight PTQ to further reduce energy and bandwidth requirements (Xiao et al., 10 Apr 2025).
Optimization of the ternary decomposition algorithm for parallel and hardware-efficient implementations, including intelligent grouping for cache-friendly access (Xiao et al., 21 Sep 2025).
Adaptive calibration and dynamic quantization level assignment, potentially evolving the “task circuit” set during runtime in accordance with input difficulty or task context (Xiao et al., 10 Apr 2025).
Theoretical analysis of quantization-induced “learning transitions” to inform token budget allocation in LLM quantization (Liu et al., 4 Feb 2025).

The robust performance, hardware suitability, and methodological flexibility of ternary-weight PTQ frameworks establish them as a foundational tool for efficient neural network deployment, spanning classical vision, representation learning, and modern generative language modeling.