TernaryLLM: Low-Bit Language Models
- TernaryLLM is a large language model with weights quantized to {-1, 0, +1}, drastically reducing memory footprint and eliminating most floating-point multiplications.
- It leverages advanced post-training and quantization-aware training methods along with innovative packing schemes and hardware accelerators for efficient inference.
- Empirical benchmarks show TernaryLLMs retain over 90% of baseline accuracy while achieving significant speedups and energy efficiency gains across diverse platforms.
A TernaryLLM is a LLM in which the majority of weights are quantized to a ternary alphabet, typically , and encoded using dense sub-2-bit representations (e.g., 1.6 or 2 bits/weight). These models achieve a drastic reduction in memory footprint and remove most floating-point multiplications from inference, while preserving a high degree of model expressiveness and accuracy. TernaryLLMs exploit advances in post-training quantization, quantization-aware training, hardware design (CPU, GPU, FPGA, ASIC), and information-theoretically motivated schemes to realize LLM inference at orders-of-magnitude lower computational cost than full-precision or even 4-bit models.
1. Mathematical Foundations of Ternary Quantization
The core operation in TernaryLLMs is the quantization of neural weights to the ternary set . The forward path of a linear or projection layer with floating-point weights is approximated as:
where and is a learnable or derived scaling factor (often applied per-row, per-column, or per-group) (Chen et al., 11 Jun 2024, Xiao et al., 21 Sep 2025, Vaidhya et al., 28 Jun 2025).
Several quantization procedures are in use:
- Hard thresholding: if , otherwise. The scale is set to minimize (Qiao et al., 22 Apr 2025, Yin et al., 23 Feb 2025).
- Dual Learnable Ternarization (DLT): Both scale and shift parameters are learned for each group, allowing the quantized-and-reconstructed weight to be (Chen et al., 11 Jun 2024).
- Structured Trit-Plane Decomposition: Advanced schemes such as PTQTP represent every row of as a sum of two ternary planes weighted by learned scales:
yielding an effective storage of bits/weight, or 1.585 bits per plane (Xiao et al., 21 Sep 2025).
- Signed-Zero Ternary (SZT): Encodes four states (using two bits), allowing additional sign information for sub-threshold weights, improving gradient flow and information density (Uhlmann, 8 Aug 2025).
Activations are usually left in higher precision (e.g., FP16 or INT8), as quantizing activations to ternary remains an outstanding challenge due to heavy-tailed distributions and significant dynamic range (Chen et al., 11 Jun 2024, Xiao et al., 21 Sep 2025).
2. Quantization Methodologies: Post-Training and Quantization-Aware Training
Two principal quantization strategies are prominent:
- Post-Training Quantization (PTQ): Applies quantization to a pretrained LLM (e.g., LLaMA, Qwen) without further gradient updates. Algorithms such as PTQTP use a monotonic, globally consistent, group-wise progressive approximation loop: alternating ridge regression updates for scale and exhaustive search for ternary assignments per group, with convergence guarantees (Xiao et al., 21 Sep 2025).
- Quantization-Aware Training (QAT): Modifies the forward pass to simulate ternary weights and employs straight-through estimators for the backward pass, learning to compensate for quantization error during training (Vaidhya et al., 28 Jun 2025, Chen et al., 11 Jun 2024). DLT augments this process with learnable shifts to better fit asymmetric weight distributions, while Outlier-Friendly Feature Distillation (OFF) guides the quantized student toward teacher representations using cosine similarity, addressing information loss due to extreme quantization (Chen et al., 11 Jun 2024).
Knowledge distillation and fine-tuning techniques (e.g., LoTA-QAF) employ low-rank trainable adapters in the ternary domain, supporting lossless merging and integer-only inference (Chen et al., 24 May 2025).
3. Packing Schemes and Hardware Implementation
Efficiently storing and operating over ternary weights is critical for realizing the theoretical savings. Key approaches:
- Bit-packing: Blocks of 5 ternary values () are packed into a single 8-bit byte, yielding 1.6 bits/weight ("TQ1" scheme); using two bits per value ("TQ2") reaches 2 bits/weight. These methods are implemented both on CPUs and GPUs for fast unpacking and high memory bandwidth utilization (Vaidhya et al., 28 Jun 2025, Huang et al., 17 Sep 2025).
- Matrix-vector multiplication (GEMM): Inference kernels are redesigned to exploit the ternary structure:
- On CPUs (e.g., Apple Silicon), custom sparse GEMM kernels using blocked, interleaved storage, loop unrolling, and NEON vectorization deliver 5–6× speedup over default libraries (Lipshitz et al., 8 Oct 2025).
- On FPGAs/ASICs, accelerators such as TENET and TeLLMe use table-lookup engines and LUT-centric ternary matmuls, slashing the need for multipliers and reducing DRAM access via specialized weight packing (Huang et al., 17 Sep 2025, Qiao et al., 22 Apr 2025, Qiao et al., 3 Oct 2025). Dynamic N:M activation sparsity further reduces compute (Huang et al., 17 Sep 2025).
- On GPUs, TriRun offers a mixed-precision CUDA kernel (FP16 activations × INT2 weights) leveraging shared memory and pipeline parallelism, achieving up to 4.9× end-to-end throughput gains (Vaidhya et al., 28 Jun 2025).
- Indexing algorithms: For fixed ternary weight matrices, block-indexed GEMV algorithms achieve time and memory by precomputing permutation and segmentation indices, with up to speedup and memory reduction in software-only settings (Dehghankar et al., 10 Nov 2024).
| Packing Method | Bits/Weight | Packing Unit | Main Platform |
|---|---|---|---|
| 2-bit ("TQ2") | 2 | k=256 | CPU, GPU |
| 1.6-bit ("TQ1") | 1.6 | k=5 | CPU, FPGA, ASIC |
| PTQTP Trit-Plane | 3.17 | group=128 | GPU, FPGA, ASIC |
4. Empirical Scaling Laws and Model Behavior
Recent empirical analysis reveals that ternary models exhibit distinctly different scaling behavior compared to their full-precision counterparts. For ternary LLMs (TriLMs) (Vaidhya et al., 28 Jun 2025): where is parameters (M), pretraining tokens (B). The data exponent () dominates the parameter exponent (), implying that expanding the dataset, rather than the model size, yields greater returns for ternary LLMs at fixed FLOPs.
For FloatLMs, the exponents are nearly matched (, ).
A practical implication is that TernaryLLMs should allocate training computation towards increasing data rather than model width/depth, diverging from established scaling rules for float-precision models.
5. Accuracy-Complexity Tradeoffs and Experimental Results
Comprehensive benchmarks show that TernaryLLMs typically retain of baseline FP16 accuracy at 1.58 bits/weight, and dramatically outperform earlier binary or poorly compensated ternary/PTQ methods.
- On Qwen3-14B, PTQTP achieves retention of mathematical reasoning test accuracy compared to FP16, versus for baseline 3-bit GPTQ under the same conditions (Xiao et al., 21 Sep 2025).
- LLaMA-3-8B with QAT (DLT+OFF) matches or outperforms 2-bit quantization, reaching higher zero-shot accuracy than the best 2-bit method (Chen et al., 11 Jun 2024).
- Language modeling perplexity increases <0.5 PPL for BitNet-1.58 quantization (Qiao et al., 22 Apr 2025, Qiao et al., 3 Oct 2025).
- For quantization-aware fine-tuning, LoTA-QAF enables lossless merging of ternary adapters, recovering or surpassing LoRA (16-bit) accuracy by up to on downstream tasks (Chen et al., 24 May 2025).
- FPGA and ASIC accelerators using optimized ternary GEMM consistently deliver end-to-end speedup and energy efficiency over A100-class GPUs (Huang et al., 17 Sep 2025, Qiao et al., 22 Apr 2025, Qiao et al., 3 Oct 2025, Yin et al., 23 Feb 2025).
6. Hardware Integration and Edge Deployment
TernaryLLMs are highly amenable to deployment on resource-constrained hardware due to their uniform, low-bit arithmetic and multiplication-free operations:
- Edge FPGAs: Engines such as TeLLMe and TerEffic store weights on-chip or in HBM, implement pipelined table-lookup matmul, and achieve the throughput and the efficiency of Jetson-class SoCs at equivalent or lower power (Qiao et al., 22 Apr 2025, Qiao et al., 3 Oct 2025, Yin et al., 23 Feb 2025).
- ASICs: TENET-ASIC deploys a heterogeneous architecture (Sparse Ternary LUT arrays plus FP16 attention blocks), reaching end-to-end inference speedup and higher energy efficiency than the NVIDIA A100 GPU, supported by custom 1.6-bit packing and decompression (Huang et al., 17 Sep 2025).
- CPUs/GPUs: Dedicated CPU kernels and the TriRun CUDA kernel unlock prompt and decode speedups of 1.5–7.9×, with dense or sparse storage for ternary weights (Lipshitz et al., 8 Oct 2025, Vaidhya et al., 28 Jun 2025).
7. Information-Theoretic and Theoretical Advances
TernaryLLM quantization is increasingly positioned as an information-theoretically optimal representation under resource constraints.
- Entropy: Log-base-three entropy yields bits/trit, realized asymptotically by 1.6-bit packing (Vaidhya et al., 28 Jun 2025, Uhlmann, 8 Aug 2025).
- SZT encoding adds “signed-zero” states, recovering additional redundancy available in the unused 2-bit codeword, greatly enhancing gradient feedback for sub-threshold weights, reducing mean-squared-error in the STE, and tightening PAC–Bayes bounds (Uhlmann, 8 Aug 2025).
- Convergence dynamics: Progressive trit-plane and DLT-decompositions are theoretically guaranteed to converge monotonically, with bounded scaling parameters (Xiao et al., 21 Sep 2025).
TernaryLLMs thus represent not only an engineering compromise for edge or memory-bounded deployments, but also a theoretically motivated, rigorously analyzed quantization regime.
References
- PTQTP: Post-Training Quantization to Trit-Planes for LLMs (Xiao et al., 21 Sep 2025)
- TernaryLLM: Ternarized LLM (Chen et al., 11 Jun 2024)
- Accelerating Sparse Ternary GEMM for Quantized LLM inference on Apple Silicon (Lipshitz et al., 8 Oct 2025)
- The Fourth State: Signed-Zero Ternary for Stable LLM Quantization (and More) (Uhlmann, 8 Aug 2025)
- LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning (Chen et al., 24 May 2025)
- TENET: An Efficient Sparsity-Aware LUT-Centric Architecture for Ternary LLM Inference On Edge (Huang et al., 17 Sep 2025)
- TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs (Qiao et al., 22 Apr 2025)
- TeLLMe v2: An Efficient End-to-End Ternary LLM Prefill and Decode Accelerator with Table-Lookup Matmul on Edge FPGAs (Qiao et al., 3 Oct 2025)
- TerEffic: Highly Efficient Ternary LLM Inference on FPGA (Yin et al., 23 Feb 2025)
- An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks (Dehghankar et al., 10 Nov 2024)
- Spectra 1.1: Scaling Laws and Efficient Inference for Ternary LLMs (Vaidhya et al., 28 Jun 2025)