Patch-Level Training for Large Language Models (2407.12665v2)

Published 17 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: As LLMs achieve remarkable progress in language understanding and generation, their training efficiency has become a critical concern. Traditionally, LLMs are trained to predict the next token in a sequence. Despite the success of token-level training, it suffers from considerable computational costs due to the need to process an extensive number of tokens. To mitigate this issue, this paper introduces patch-level training for LLMs, which reduces the sequence length by compressing multiple tokens into a single patch. During patch-level training, we feed the LLM shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced computational cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B parameters) demonstrate that patch-level training can reduce overall computational costs to 0.5$\times$, without compromising the model performance compared to token-level training. Source code: \url{https://github.com/shaochenze/PatchTrain}.

PDF HTML Abstract

Patch-Level Training for LLMs: An Efficient Approach

The paper Patch-Level Training for LLMs by Chenze Shao et al. tackles a critical concern in the field of LLMs - the computational cost of training. This paper presents an innovative training strategy termed "patch-level training," aimed at enhancing training efficiency without compromising model performance. The core idea revolves around compressing multiple tokens into patches for initial training, significantly reducing computational loads by processing shorter sequences.

Methodology

Patch-Level Training Mechanism

The conventional token-level training of LLMs requires each token to be processed individually, leading to considerable computational expenses due to the extensive token sequences involved. Shao et al. propose an alternative training paradigm where sequences of tokens are compressed into patches. This patch-level training involves constructing shorter sequences of patches and training the model to predict the subsequent patch.

This training method is divided into two stages:

Patch-Level Training: The model is fed shorter sequences of patches, compressing groups of $K$ tokens into single patches, and it learns to predict the next patch.
Token-Level Training: Subsequently, a smaller subset of data is used to continue with token-level training. This step aligns with the model's inference mode, refining the parameters initialized during the patch-level training to adapt to token-level inputs.

The trained parameters from the patch-level model (PLM) are harnessed to initialize the token-level model (TLM), ensuring a seamless knowledge transfer.

Experimental Validation

Training Efficiency

The substantial reduction in computational requirements through patch-level training is illustrated with LLMs of varying sizes (ranging from 370M to 2.7B parameters). By setting the patch size $K=4$ and using this method on $\lambda=2/3$ of the training data, the paper demonstrates that the computational cost can be halved without any significant detriment to performance. Empirically, this approach maintained comparable perplexity scores and even enhanced model performance on zero-shot NLP benchmarks.

Performance Across Different Settings

The experiments underscore that patch-level training manifests favorable scaling properties, particularly when ample training data is available. Notably, the performance benefits from an increase in data size more swiftly than traditional token-level training. However, the model performance gain due to an increase in model size was relatively modest in comparison.

Detailed Analysis

Hyperparameters: Patch Size and Data Fraction

Patch Size ( $K$ ): The findings suggest that a patch size of $K=4$ achieves a balance between efficiency and performance. Larger patch sizes (e.g., $K=16$ ) exhibit slight performance degradation, possibly due to the increased challenge in transferring knowledge to the token-level model.
Data Fraction ( $\lambda$ ): The hyperparameter $\lambda$ determines the proportion of patch-level to token-level training data. An optimal $\lambda$ of around 2/3 was identified under fixed computational budgets, providing a beneficial trade-off between training cost and model performance.

Neuron Activation Perspective

To elucidate why patch-level training yields better learning efficiency, the paper presents an analysis of neuron activation. Training with patches leads to increased neuron activation rates, particularly in earlier transformer layers, indicating a denser and potentially more effective utilization of model parameters during the knowledge acquisition phase.

Implications and Future Directions

Patch-level training heralds a significant advancement in the efficiency of training large-scale LLMs. Practically, this approach can halve training costs without sacrificing model accuracy or performance, thereby making LLM development more accessible and sustainable. Theoretically, it opens avenues to explore neuron activation patterns and their impacts on efficient learning.

Future research could focus on validating the scalability of patch-level training with model and dataset sizes akin to state-of-the-art LLMs. Understanding the empirical scaling laws that incorporate varying $K$ and $\lambda$ values could further optimize training costs. Additionally, investigating the applicability of patch-level training to other domains, such as image or speech processing, would expand the horizons of this method's utility.

In conclusion, the paper by Shao et al. presents a substantial contribution to efficient LLM training methodologies. By innovatively restructuring the training process with patch-level sequences, they achieve a balance of reduced computational burdens and maintained model performance, laying the groundwork for future efficiency-focused advancements in machine learning.