Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NDP: Next Distribution Prediction as a More Broad Target (2408.17377v1)

Published 30 Aug 2024 in cs.CL and cs.AI

Abstract: LLMs trained on next-token prediction (NTP) paradigm have demonstrated powerful capabilities. However, the existing NTP paradigm contains several limitations, particularly related to planned task complications and error propagation during inference. In our work, we extend the critique of NTP, highlighting its limitation also due to training with a narrow objective: the prediction of a sub-optimal one-hot distribution. To support this critique, we conducted a pre-experiment treating the output distribution from powerful LLMs as efficient world data compression. By evaluating the similarity between the $n$-gram distribution and the one-hot distribution with LLMs, we observed that the $n$-gram distributions align more closely with the output distribution of LLMs. Based on this insight, we introduce Next Distribution Prediction (NDP), which uses $n$-gram distributions to replace the one-hot targets, enhancing learning without extra online training time. We conducted experiments across translation, general task, language transfer, and medical domain adaptation. Compared to NTP, NDP can achieve up to +2.97 COMET improvement in translation tasks, +0.61 average improvement in general tasks, and incredible +10.75 average improvement in the medical domain. This demonstrates the concrete benefits of addressing the target narrowing problem, pointing to a new direction for future work on improving NTP.

Summary

  • The paper introduces NDP as a new paradigm that replaces one-hot targets with n-gram distributions to address limitations of traditional next-token prediction.
  • The methodology combines supervised and causal language modeling by aligning n-gram distributions with LLM outputs, achieving improvements of up to 0.61 points.
  • Experimental results confirm NDP’s effectiveness in general, translation, and domain adaptation tasks, enabling more efficient and robust model training.

Next Distribution Prediction: A Broader Objective for LLM Training

The paper "NDP: Next Distribution Prediction as a More Broad Target" by Junhao Ruan et al. proposes a novel paradigm for training LLMs to address inherent limitations in the traditional next-token prediction (NTP) approach. This work is centered around the introduction of Next Distribution Prediction (NDP), which leverages nn-gram distributions as a target for training, replacing the conventional one-hot targets to enhance model performance without additional online training time. The authors claim that NDP provides significant improvements across various tasks and domains when compared to NTP.

Introduction

The existing NTP paradigm, while powerful, is criticized for its inability to handle tasks that require advanced planning and for suffering from error propagation during inference. Furthermore, NTP trains models using a narrow objective—approximating a sub-optimal one-hot distribution, where a single token is treated as the "correct" successor. This diverges from human cognitive processes, which consider multiple potential successor tokens. Ruan et al. argue that the ideal target for model learning should be a non-one-hot distribution, representing the statistical reality of a comprehensive world dataset.

Methodology

To validate their hypothesis, the authors conducted an initial experiment comparing the alignment of nn-gram and one-hot distributions with the output distribution of powerful LLMs. They observed that nn-gram distributions aligned more closely with the LLM outputs, leading to the development of the NDP approach.

NDP replaces the one-hot targets with nn-gram distributions, generating separate distributions for supervised and causal LLMing (CLM) components, which are then combined during training. This method allows the model to incorporate statistical realities from larger datasets efficiently, improving learning accuracy.

Experimental Setup

The authors performed extensive experiments across different model sizes and domains, including:

  • General tasks with LLMs (e.g., Gemma-2B, LLaMA3-8B)
  • Translation tasks with T5 models (small, base, large)
  • Domain adaptation tasks using medical data for Qwen2-7B and LLaMA3-8B
  • Unifying continued pre-training (CPT) and instruction fine-tuning (IFT) processes

Results

The experiments demonstrated substantial improvements using NDP:

  1. General Tasks: NDP consistently outperformed NTP on all evaluated benchmarks, with improvements of up to +0.61 points on average.
  2. Translation Tasks: NDP showed significant gains in both BLEU and COMET scores, particularly for out-of-domain generalization.
  3. Domain Adaptation: When applied to the medical domain, NDP outperformed NTP significantly, particularly in models without prior domain-specific training.
  4. Unified CPT and IFT: NDP enabled simultaneous use of supervised and unsupervised data, enhancing model adaptability and performance with reduced training overhead.

Implications and Future Directions

NDP provides a promising alternative to NTP by addressing its candidate narrowing problem and allowing models to learn from more realistic statistical distributions. This approach could have far-reaching implications, including improved efficiency and performance in fine-tuning processes, better generalization across domains, and more robust training frameworks for LLMs.

Future research could explore the integration of more complex nn-gram models, adaptive fusion techniques for combining supervised and CLM distributions, and further analyses of NDP's convergence properties in various settings. Additional studies might focus on optimizing hyperparameters for NDP, testing the approach in different languages and tasks, and evaluating its performance in real-world applications.

Conclusion

The paper by Ruan et al. presents a critical advancement in LLM training paradigms by introducing NDP. By leveraging nn-gram distributions, NDP addresses the inherent limitations of NTP, leading to substantial improvements across various tasks and domains. This work paves the way for future developments in model training strategies, emphasizing the importance of broader and more nuanced learning objectives.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com