Dual Learnable Ternarization for LLMs

Updated 12 December 2025

Dual Learnable Ternarization (DLT) is a quantization method that maps LLM weights to ternary values using learnable scale and shift parameters to address asymmetry and non-zero means.
By partitioning weight groups and adapting thresholds based on the mean absolute value, DLT reduces clamp and rounding errors, thereby improving model compression and performance.
Empirical evidence shows that DLT, especially when combined with Outlier-Friendly Feature Knowledge Distillation, lowers perplexity and increases accuracy on NLP benchmarks.

Dual Learnable Ternarization (DLT) is a quantization technique developed for LLMs, enabling extreme weight compression by mapping weights to ternary values while adaptively correcting both magnitude and mean per quantization group. Motivated by the presence of asymmetric outliers and non-zero means in LLM weights, DLT builds upon classic ternary weight networks by introducing a learnable scale and shift per group, achieving stronger alignment with real-world LLM weight distributions. Empirical evidence demonstrates that DLT, especially when combined with Outlier-Friendly Feature Knowledge Distillation (OFF), improves performance across standard NLP benchmarks relative to previous low-bit quantization methods (Chen et al., 11 Jun 2024).

1. Mathematical Formulation

DLT operates on partitioned groups of floating-point weights, $W \in \mathbb{R}^N$ , typically corresponding to rows, per-channel blocks, or other groupings in LLMs. For each group:

The threshold $\Delta$ is computed:

$\Delta = 0.7 \, \frac{1}{N}\sum_{i=1}^N |W_i|$

Ternary codes $T_i \in \{-1, 0, +1\}$ are assigned:

$T_i = \begin{cases} -1 & W_i < -\Delta \ 0 & |W_i| \le \Delta \ +1 & W_i > \Delta \end{cases} \quad \text{for } i=1\ldots N$

Each group is then equipped with two trainable parameters:
- Scale $\alpha \in \mathbb{R}^+$
- Shift $\gamma \in \mathbb{R}$
Quantized weights are given by:

$\hat W_i = D_i = \alpha\, T_i + \gamma \quad,\quad i=1\ldots N$

This dual-parameter scheme enables the quantized ternary palette to represent both the correct scale and groupwise mean, directly addressing the non-zero mean phenomena often observed in LLM weight groups.

2. Training Objectives and Parameter Learning

DLT integrates quantization into the training process via quantization-aware fine-tuning, optimizing $\alpha$ and $\gamma$ along with base model parameters. The total loss function is composed of three terms:

Label loss ( $\mathcal{L}_{\rm label}$ ): cross-entropy between student logits $Z^{\rm Stu}$ and one-hot labels $Z^{\rm label}$ .
Logits distillation ( $\mathcal{L}_{\rm logits}$ ): cross-entropy between student and full-precision teacher logits, downscaled by $\epsilon$ .
Outlier-Friendly Feature Distillation ( $\mathcal{L}_{\rm feat}$ ): sum of pairwise cosine similarities between corresponding student and teacher hidden-states $h_{l,t}$ at each layer $l$ and token position $t$ ,

$\mathcal{L}_{\rm feat} = \sum_{l=1}^{L'} \sum_{t=1}^T \left[ \cos\left( h_{l,t}^{\rm Stu}, h_{l,t}^{\rm Teach} \right) \right]$

This term, weighted by $\delta$ , is designed to be insensitive to outliers, as substantiated by Theorem 1 in the source.

The combined objective:

$\mathcal{L}_{\rm total} = \mathcal{L}_{\rm label} + \epsilon\, \mathcal{L}_{\rm logits} + \delta\, \mathcal{L}_{\rm feat}$

$\alpha$ is initialized by a TWN-style closed-form solution:

$\alpha \leftarrow \frac{\sum_i T_i W_i}{\sum_i |T_i|}$

and $\gamma \leftarrow 0$ . Both parameters are trained using AdamW with zero weight decay and a learning rate $0.1\times$ that of the main network weights.

3. Addressing Asymmetry and Non-Zero Means

Conventional ternarization, such as the Ternary Weight Network (TWN), employs a single scale factor $\alpha$ and assigns zero to elements with absolute value beneath a fixed threshold, implicitly assuming symmetric, zero-mean weight distributions within groups. In contrast, LLM weight groups often exhibit asymmetric outliers and non-zero means. DLT maintains the groupwise threshold ( $\Delta \approx 0.7 \times$ mean $|W|$ ) so that only small-magnitude weights are set to zero. By learning both $\alpha$ (scale) and $\gamma$ (shift), DLT dynamically corrects groupwise magnitude and offset, substantially reducing clamp-error (for the $[-\Delta, \Delta]$ bin) and rounding-error (for the tails), and thus mitigating biases introduced by asymmetric distributions.

A plausible implication is that DLT generalizes more robustly to the heterogeneity found in modern LLM architectures compared to strictly symmetric ternarization schemes.

4. Gradient Propagation and Optimization

Gradient computation for DLT parameters is derived via chain rule from the quantized representation:

$\frac{\partial \mathcal{L}}{\partial \alpha} = \sum_{i=1}^N T_i\, \frac{\partial \mathcal{L}}{\partial D_i}$

$\frac{\partial \mathcal{L}}{\partial \gamma} = \sum_{i=1}^N \frac{\partial \mathcal{L}}{\partial D_i}$

Gradients through non-differentiable ternary assignment $T_i(W_i)$ are propagated using the Straight-Through Estimator (STE), with updates:

$\frac{\partial \mathcal{L}}{\partial W_i} \approx \begin{cases} +\alpha\, \frac{\partial \mathcal{L}}{\partial D_i} & W_i > \Delta \ 1\, \frac{\partial \mathcal{L}}{\partial D_i} & |W_i| \le \Delta \ -\alpha\, \frac{\partial \mathcal{L}}{\partial D_i} & W_i < -\Delta \end{cases}$

All parameters, including $W$ , $\alpha$ , and $\gamma$ , are updated via AdamW using the specified learning rates.

5. Empirical Performance and Ablations

DLT has demonstrated substantial empirical gains over prior quantization-aware schemes. Key results include:

Model/Setting	DLT + OFF (W1.58A16)	Prior Art (W2A16, DB-LLM)	Absolute Gain
LLaMA-3-7B, C4 PPL	13.4	19.2	-5.8 (lower is better)
LLaMA-3-7B, Avg. Acc	60.0%	51.8%	+8.2% (absolute increase)
OPT-1.3B, C4 PPL	18.01	27.34 (AWQ, 2b)	-9.33
OPT-1.3B, C4 PPL	18.01	31.31 (GPTQ, 2b)	-13.3

When replacing TWN with DLT in ablations:

On OPT-1.3B, PPL is reduced from 22.32 to 20.83 (−1.49)
On LLaMA-1-7B, PPL decreases from 10.10 to 9.21 (−0.89)

Further reduction is observed when combining DLT with OFF, achieving 0.77 additional PPL reduction versus logits-only distillation.

6. Integration with Outlier-Friendly Feature Knowledge Distillation

DLT is synergistically paired with OFF, which leverages cosine similarity between hidden-state features of student (ternarized) and teacher (full-precision) models. This approach is robust to outliers, allowing semantic and distributional information to be transferred despite aggressive ternarization. The addition of OFF to DLT further improves both perplexity and downstream task accuracy.

In summary, Dual Learnable Ternarization introduces a minimal but powerful extension of groupwise ternary quantization by learning both scale and shift per group, directly addressing real-world LLM weight asymmetries. This results in lower quantization error, superior empirical performance across LLM families, and compatibility with quantization-aware fine-tuning protocols utilizing STE and knowledge distillation (Chen et al., 11 Jun 2024).

PDF Markdown Chat (Pro)

References (1)

TernaryLLM: Ternarized Large Language Model (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Dual Learnable Ternarization (DLT).