FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation (2407.07093v1)

Published 9 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: This work presents a Fully BInarized LLM (FBI-LLM), demonstrating for the first time how to train a large-scale binary LLM from scratch (not the partial binary or ternary LLM like BitNet b1.58) to match the performance of its full-precision counterparts (e.g., FP16 or BF16) in transformer-based LLMs. It achieves this by employing an autoregressive distillation (AD) loss with maintaining equivalent model dimensions (130M, 1.3B, 7B) and training data volume as regular LLM pretraining, while delivering competitive results in terms of perplexity and task-specific effectiveness. Intriguingly, by analyzing the training trajectory, we find that the pretrained weight is not necessary for training binarized LLMs from scratch. This research encourages a new computational framework and may facilitate the future design of specialized hardware tailored for fully 1-bit LLMs. We make all models, code, and training dataset fully accessible and transparent to support further research (Code: https://github.com/LiqunMa/FBI-LLM. Model: https://huggingface.co/LiqunMa/).

PDF HTML Abstract

An Analytical Overview of "FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation"

The paper "FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation" authored by Liqun Ma, Mingjie Sun, and Zhiqiang Shen, introduces the concept of Fully Binarized LLMs (FBI-LLMs), offering an avant-garde methodology for training LLMs with binary weights directly from scratch. This work departs from conventional quantization strategies by focusing purely on 1-bit binarization, thus promising notable enhancements in storage and inference efficiency while maintaining performance parity with full-precision models.

Methodological Innovations and Training Architecture

FBI-LLM presents a sophisticated yet streamlined methodology for training binarized LLMs. The architecture incorporates FBI-Linear modules replacing traditional linear layers, utilizing learnable scaling parameters ( $\alpha$ and $\beta$ ) to mitigate binarization inaccuracies. This binarized approach is applied to models while retaining the embedding, LayerNorm, and head layers in full precision to maintain semantic integrity and numerical stability.

A pivotal innovation is the autoregressive distillation (AD) technique used in the training process. Instead of directly minimizing cross-entropy with one-hot labels typical for LLMs, AD leverages the probability distribution from a pretrained teacher model to guide the binarized student model. This distillation loss simplifies the training pipeline and improves the parameter optimization trajectory, yielding performance highly competitive with full-precision counterparts.

Empirical Results and Performance Evaluation

The empirical evaluation of FBI-LLM spans model configurations of 130M, 1.3B, and 7B parameters, targeting diverse LLM sizes. The FBI-LLMs exhibit a performance gap that remains small when compared to their full-precision analogs, showcasing robust perplexity and downstream task effectiveness. Notably, the FBI-LLM 1.3B model achieves up to 87% of the performance of full-precision models of similar scale, thereby asserting the feasibility of binary-weight LLMs for practical applications.

Performance metrics highlight that the FBI-LLMs excel in achieving lower perplexity and improved downstream accuracy. For instance, the 1.3B-scale FBI-LLM demonstrates superior results on tasks like BoolQA and OpenbookQA, surpassing binary and even some ternary quantized models like BitNet b1.58 across multiple tasks.

Analysis and Theoretical Implications

A deep dive into the training stability and model behavior reveals critical insights. The research finds negligible differences in the effectiveness of training from scratch versus continuing from a pretrained LLM, attributing instability in the latter method to inherent divergences in parameter space patterns between binarized and full-precision models. This realization simplifies the training process, advocating for direct training from scratch as a viable and stable approach for binarized LLMs.

The paper employs metrics such as the flip-flop ratio and gradient norms to track training dynamics. The results underscore the relative stability of scratch-training binarized models, observing occasional but manageable instability spikes, thus validating the robustness of the proposed methodology.

Practical Implications and Future Directions

FBI-LLM paves the way for future LLM implementations that are more storage and computation-efficient. The hardware implications are substantial, considering the drastic reduction in model size and computational load, potentially influencing the future design of specialized hardware for 1-bit LLM inference.

Future research could explore the binarization of intermediate activations and further hardware optimization to realize full-speed benefits. Additionally, scalable training methods that harness larger training datasets efficiently could refine FBI-LLMs' capabilities, pushing them closer to full-precision performance levels.

In summary, the FBI-LLM framework represents a significant methodological advancement for the training of binarized LLMs, potentially setting a precedent for future AI model design and deployment. The work's empirical success and analytical insights offer a compelling perspective on the viability of extreme quantization in neural model training, heralding a new era of efficient, high-performance neural LLMs.