- The paper introduces NDP as a new paradigm that replaces one-hot targets with n-gram distributions to address limitations of traditional next-token prediction.
- The methodology combines supervised and causal language modeling by aligning n-gram distributions with LLM outputs, achieving improvements of up to 0.61 points.
- Experimental results confirm NDP’s effectiveness in general, translation, and domain adaptation tasks, enabling more efficient and robust model training.
Next Distribution Prediction: A Broader Objective for LLM Training
The paper "NDP: Next Distribution Prediction as a More Broad Target" by Junhao Ruan et al. proposes a novel paradigm for training LLMs to address inherent limitations in the traditional next-token prediction (NTP) approach. This work is centered around the introduction of Next Distribution Prediction (NDP), which leverages n-gram distributions as a target for training, replacing the conventional one-hot targets to enhance model performance without additional online training time. The authors claim that NDP provides significant improvements across various tasks and domains when compared to NTP.
Introduction
The existing NTP paradigm, while powerful, is criticized for its inability to handle tasks that require advanced planning and for suffering from error propagation during inference. Furthermore, NTP trains models using a narrow objective—approximating a sub-optimal one-hot distribution, where a single token is treated as the "correct" successor. This diverges from human cognitive processes, which consider multiple potential successor tokens. Ruan et al. argue that the ideal target for model learning should be a non-one-hot distribution, representing the statistical reality of a comprehensive world dataset.
Methodology
To validate their hypothesis, the authors conducted an initial experiment comparing the alignment of n-gram and one-hot distributions with the output distribution of powerful LLMs. They observed that n-gram distributions aligned more closely with the LLM outputs, leading to the development of the NDP approach.
NDP replaces the one-hot targets with n-gram distributions, generating separate distributions for supervised and causal LLMing (CLM) components, which are then combined during training. This method allows the model to incorporate statistical realities from larger datasets efficiently, improving learning accuracy.
Experimental Setup
The authors performed extensive experiments across different model sizes and domains, including:
- General tasks with LLMs (e.g., Gemma-2B, LLaMA3-8B)
- Translation tasks with T5 models (small, base, large)
- Domain adaptation tasks using medical data for Qwen2-7B and LLaMA3-8B
- Unifying continued pre-training (CPT) and instruction fine-tuning (IFT) processes
Results
The experiments demonstrated substantial improvements using NDP:
- General Tasks: NDP consistently outperformed NTP on all evaluated benchmarks, with improvements of up to +0.61 points on average.
- Translation Tasks: NDP showed significant gains in both BLEU and COMET scores, particularly for out-of-domain generalization.
- Domain Adaptation: When applied to the medical domain, NDP outperformed NTP significantly, particularly in models without prior domain-specific training.
- Unified CPT and IFT: NDP enabled simultaneous use of supervised and unsupervised data, enhancing model adaptability and performance with reduced training overhead.
Implications and Future Directions
NDP provides a promising alternative to NTP by addressing its candidate narrowing problem and allowing models to learn from more realistic statistical distributions. This approach could have far-reaching implications, including improved efficiency and performance in fine-tuning processes, better generalization across domains, and more robust training frameworks for LLMs.
Future research could explore the integration of more complex n-gram models, adaptive fusion techniques for combining supervised and CLM distributions, and further analyses of NDP's convergence properties in various settings. Additional studies might focus on optimizing hyperparameters for NDP, testing the approach in different languages and tasks, and evaluating its performance in real-world applications.
Conclusion
The paper by Ruan et al. presents a critical advancement in LLM training paradigms by introducing NDP. By leveraging n-gram distributions, NDP addresses the inherent limitations of NTP, leading to substantial improvements across various tasks and domains. This work paves the way for future developments in model training strategies, emphasizing the importance of broader and more nuanced learning objectives.