Papers
Topics
Authors
Recent
2000 character limit reached

TernaryLLM: Low-Bit Language Models

Updated 12 December 2025
  • TernaryLLM is a large language model with weights quantized to {-1, 0, +1}, drastically reducing memory footprint and eliminating most floating-point multiplications.
  • It leverages advanced post-training and quantization-aware training methods along with innovative packing schemes and hardware accelerators for efficient inference.
  • Empirical benchmarks show TernaryLLMs retain over 90% of baseline accuracy while achieving significant speedups and energy efficiency gains across diverse platforms.

A TernaryLLM is a LLM in which the majority of weights are quantized to a ternary alphabet, typically {1,0,+1}\{-1, 0, +1\}, and encoded using dense sub-2-bit representations (e.g., 1.6 or 2 bits/weight). These models achieve a drastic reduction in memory footprint and remove most floating-point multiplications from inference, while preserving a high degree of model expressiveness and accuracy. TernaryLLMs exploit advances in post-training quantization, quantization-aware training, hardware design (CPU, GPU, FPGA, ASIC), and information-theoretically motivated schemes to realize LLM inference at orders-of-magnitude lower computational cost than full-precision or even 4-bit models.

1. Mathematical Foundations of Ternary Quantization

The core operation in TernaryLLMs is the quantization of neural weights to the ternary set {1,0,+1}\{-1, 0, +1\}. The forward path of a linear or projection layer with floating-point weights WRn×dW\in\mathbb{R}^{n\times d} is approximated as:

WW~=αTW \approx \widetilde{W} = \alpha \cdot T

where T{1,0,1}n×dT \in \{-1,0,1\}^{n\times d} and α\alpha is a learnable or derived scaling factor (often applied per-row, per-column, or per-group) (Chen et al., 11 Jun 2024, Xiao et al., 21 Sep 2025, Vaidhya et al., 28 Jun 2025).

Several quantization procedures are in use:

  • Hard thresholding: Tij=sign(Wij)T_{ij} = \text{sign}(W_{ij}) if Wij>Δ|W_{ij}| > \Delta, Tij=0T_{ij}=0 otherwise. The scale α\alpha is set to minimize WαTF2\|W - \alpha T\|_F^2 (Qiao et al., 22 Apr 2025, Yin et al., 23 Feb 2025).
  • Dual Learnable Ternarization (DLT): Both scale α\alpha and shift γ\gamma parameters are learned for each group, allowing the quantized-and-reconstructed weight to be Di=αTi+γD_i = \alpha T_i + \gamma (Chen et al., 11 Jun 2024).
  • Structured Trit-Plane Decomposition: Advanced schemes such as PTQTP represent every row of WW as a sum of two ternary planes weighted by learned scales:

Wiαi(1)Ti(1)+αi(2)Ti(2)W_i \approx \alpha_i^{(1)} T^{(1)}_i + \alpha_i^{(2)} T^{(2)}_i

yielding an effective storage of 21.5853.172\cdot 1.585 \approx 3.17 bits/weight, or 1.585 bits per plane (Xiao et al., 21 Sep 2025).

  • Signed-Zero Ternary (SZT): Encodes four states (using two bits), allowing additional sign information for sub-threshold weights, improving gradient flow and information density (Uhlmann, 8 Aug 2025).

Activations are usually left in higher precision (e.g., FP16 or INT8), as quantizing activations to ternary remains an outstanding challenge due to heavy-tailed distributions and significant dynamic range (Chen et al., 11 Jun 2024, Xiao et al., 21 Sep 2025).

2. Quantization Methodologies: Post-Training and Quantization-Aware Training

Two principal quantization strategies are prominent:

  • Post-Training Quantization (PTQ): Applies quantization to a pretrained LLM (e.g., LLaMA, Qwen) without further gradient updates. Algorithms such as PTQTP use a monotonic, globally consistent, group-wise progressive approximation loop: alternating ridge regression updates for scale and exhaustive search for ternary assignments per group, with convergence guarantees (Xiao et al., 21 Sep 2025).
  • Quantization-Aware Training (QAT): Modifies the forward pass to simulate ternary weights and employs straight-through estimators for the backward pass, learning to compensate for quantization error during training (Vaidhya et al., 28 Jun 2025, Chen et al., 11 Jun 2024). DLT augments this process with learnable shifts to better fit asymmetric weight distributions, while Outlier-Friendly Feature Distillation (OFF) guides the quantized student toward teacher representations using cosine similarity, addressing information loss due to extreme quantization (Chen et al., 11 Jun 2024).

Knowledge distillation and fine-tuning techniques (e.g., LoTA-QAF) employ low-rank trainable adapters in the ternary domain, supporting lossless merging and integer-only inference (Chen et al., 24 May 2025).

3. Packing Schemes and Hardware Implementation

Efficiently storing and operating over ternary weights is critical for realizing the theoretical savings. Key approaches:

  • Bit-packing: Blocks of 5 ternary values (35=2433^5=243) are packed into a single 8-bit byte, yielding 1.6 bits/weight ("TQ1" scheme); using two bits per value ("TQ2") reaches 2 bits/weight. These methods are implemented both on CPUs and GPUs for fast unpacking and high memory bandwidth utilization (Vaidhya et al., 28 Jun 2025, Huang et al., 17 Sep 2025).
  • Matrix-vector multiplication (GEMM): Inference kernels are redesigned to exploit the ternary structure:
  • Indexing algorithms: For fixed ternary weight matrices, block-indexed GEMV algorithms achieve O(n2/logn)O(n^2/\log n) time and memory by precomputing permutation and segmentation indices, with up to 29×29\times speedup and 6×6\times memory reduction in software-only settings (Dehghankar et al., 10 Nov 2024).
Packing Method Bits/Weight Packing Unit Main Platform
2-bit ("TQ2") 2 k=256 CPU, GPU
1.6-bit ("TQ1") 1.6 k=5 CPU, FPGA, ASIC
PTQTP Trit-Plane 3.17 group=128 GPU, FPGA, ASIC

4. Empirical Scaling Laws and Model Behavior

Recent empirical analysis reveals that ternary models exhibit distinctly different scaling behavior compared to their full-precision counterparts. For ternary LLMs (TriLMs) (Vaidhya et al., 28 Jun 2025): Loss(N,D)2.19+4.73N0.32+5.18D0.81\text{Loss}(N, D) \approx 2.19 + 4.73 N^{-0.32} + 5.18 D^{-0.81} where NN is parameters (M), DD pretraining tokens (B). The data exponent (β=0.81\beta=0.81) dominates the parameter exponent (α=0.32\alpha=0.32), implying that expanding the dataset, rather than the model size, yields greater returns for ternary LLMs at fixed FLOPs.

For FloatLMs, the exponents are nearly matched (α=0.56\alpha=0.56, β=0.53\beta=0.53).

A practical implication is that TernaryLLMs should allocate training computation towards increasing data rather than model width/depth, diverging from established scaling rules for float-precision models.

5. Accuracy-Complexity Tradeoffs and Experimental Results

Comprehensive benchmarks show that TernaryLLMs typically retain >90%>90\% of baseline FP16 accuracy at 1.58 bits/weight, and dramatically outperform earlier binary or poorly compensated ternary/PTQ methods.

6. Hardware Integration and Edge Deployment

TernaryLLMs are highly amenable to deployment on resource-constrained hardware due to their uniform, low-bit arithmetic and multiplication-free operations:

  • Edge FPGAs: Engines such as TeLLMe and TerEffic store weights on-chip or in HBM, implement pipelined table-lookup matmul, and achieve >16×>16\times the throughput and >8×>8\times the efficiency of Jetson-class SoCs at equivalent or lower power (Qiao et al., 22 Apr 2025, Qiao et al., 3 Oct 2025, Yin et al., 23 Feb 2025).
  • ASICs: TENET-ASIC deploys a heterogeneous architecture (Sparse Ternary LUT arrays plus FP16 attention blocks), reaching 2.7×2.7\times end-to-end inference speedup and 21×21\times higher energy efficiency than the NVIDIA A100 GPU, supported by custom 1.6-bit packing and decompression (Huang et al., 17 Sep 2025).
  • CPUs/GPUs: Dedicated CPU kernels and the TriRun CUDA kernel unlock prompt and decode speedups of 1.5–7.9×, with dense or sparse storage for ternary weights (Lipshitz et al., 8 Oct 2025, Vaidhya et al., 28 Jun 2025).

7. Information-Theoretic and Theoretical Advances

TernaryLLM quantization is increasingly positioned as an information-theoretically optimal representation under resource constraints.

  • Entropy: Log-base-three entropy yields log231.585\log_2 3 \approx 1.585 bits/trit, realized asymptotically by 1.6-bit packing (Vaidhya et al., 28 Jun 2025, Uhlmann, 8 Aug 2025).
  • SZT encoding adds “signed-zero” states, recovering additional redundancy available in the unused 2-bit codeword, greatly enhancing gradient feedback for sub-threshold weights, reducing mean-squared-error in the STE, and tightening PAC–Bayes bounds (Uhlmann, 8 Aug 2025).
  • Convergence dynamics: Progressive trit-plane and DLT-decompositions are theoretically guaranteed to converge monotonically, with bounded scaling parameters (Xiao et al., 21 Sep 2025).

TernaryLLMs thus represent not only an engineering compromise for edge or memory-bounded deployments, but also a theoretically motivated, rigorously analyzed quantization regime.

References

Whiteboard

Follow Topic

Get notified by email when new papers are published related to TernaryLLM.