TernaryLLM: Ternarized Large Language Model (2406.07177v1)

Published 11 Jun 2024 in cs.LG

Abstract: LLMs have achieved remarkable performance on NLP tasks, but they are hindered by high computational costs and memory requirements. Ternarization, an extreme form of quantization, offers a solution by reducing memory usage and enabling energy-efficient floating-point additions. However, applying ternarization to LLMs faces challenges stemming from outliers in both weights and activations. In this work, observing asymmetric outliers and non-zero means in weights, we introduce Dual Learnable Ternarization (DLT), which enables both scales and shifts to be learnable. We also propose Outlier-Friendly Feature Knowledge Distillation (OFF) to recover the information lost in extremely low-bit quantization. The proposed OFF can incorporate semantic information and is insensitive to outliers. At the core of OFF is maximizing the mutual information between features in ternarized and floating-point models using cosine similarity. Extensive experiments demonstrate that our TernaryLLM surpasses previous low-bit quantization methods on the standard text generation and zero-shot benchmarks for different LLM families. Specifically, for one of the most powerful open-source models, LLaMA-3, our approach (W1.58A16) outperforms the previous state-of-the-art method (W2A16) by 5.8 in terms of perplexity on C4 and by 8.2% in terms of average accuracy on zero-shot tasks.

PDF HTML Abstract

TernaryLLM: Ternarized LLM

The paper "TernaryLLM: Ternarized LLM" presents a detailed paper on quantization techniques applied to LLMs with the goal of enhancing computational efficiency and reducing memory footprint. The primary focus of this research is on ternarization, an extreme form of quantization that represents weights with three discrete levels. The challenges associated with applying ternarization to LLMs are carefully examined and addressed through novel methodologies.

Key Contributions

Dual Learnable Ternarization (DLT): The authors observe that weights in LLMs exhibit asymmetric outliers and a non-zero mean distribution, which traditional symmetric ternarization methods cannot effectively handle. DLT addresses this by introducing learnable scales and shifts, which better adapt to the unique distributions of LLM weights. This method enables more accurate representation of the weights in quantized form, thereby reducing approximation errors.
Outlier-Friendly Feature Knowledge Distillation (OFF): Ternarization can lead to substantial information loss, especially for pretrained LLMs. To mitigate this, the authors propose OFF, which aims to recover lost information by maximizing the mutual information between features in the ternarized and floating-point models. This approach leverages cosine similarity, which is less sensitive to outliers, thereby improving the stability and performance of the training process.

Experimental Results

The effectiveness of TernaryLLM is validated through extensive experiments on various standard NLP benchmarks, including text generation and zero-shot tasks using models from the OPT and LLaMA families. Key results include:

Perplexity Improvement: For the LLaMA-3 model, the proposed approach (W1.58A16) surpasses the previous state-of-the-art method (W2A16) by achieving an average of 5.8 lower perplexity on the C4 dataset.
Zero-Shot Accuracy: On zero-shot tasks, the same model shows an improvement of 8.2% in average accuracy compared to previous methods.

Detailed comparisons with prior weight-only quantization techniques—such as RTN, GPTQ, AWQ, OmniQuant, PB-LLM, and DB-LLM—demonstrate TernaryLLM's superior performance. Notably, even with a reduced bit-width of 1.58, the method outperforms existing 2-bit quantization-aware training methods in maintaining model accuracy.

Implications and Future Directions

The practical implications of this research are profound. By efficiently deploying LLMs with reduced computational and memory requirements, real-world applications can achieve higher performance and lower latency at reduced operational costs. Specifically, the introduction of energy-efficient floating-point additions instead of multiplications promises significant reductions in power consumption—a crucial factor for large-scale deployments.

Theoretically, the novel approaches of DLT and OFF redefine the strategies for extreme quantization in neural networks. By allowing learnable parameters in ternarized models and emphasizing the recovery of semantic information, this research opens new avenues for improving the accuracy and robustness of quantized LLMs.

Future developments may include further fine-tuning of the ternarization process, exploring new architectures that are inherently more amenable to low-bit quantization, and developing hardware accelerators optimized for ternarized LLMs. Moreover, the strategies outlined in this paper could be extended to other types of neural networks beyond transformers, such as convolutional neural networks or recurrent neural networks, potentially broadening the impact of this research across various domains of artificial intelligence.

In conclusion, "TernaryLLM: Ternarized LLM" provides a significant contribution to the field of neural network quantization. The proposed methodologies facilitate the deployment of efficient, high-performance LLMs, paving the way for more scalable and sustainable AI applications.