Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Semiparametric Token-Sequence Co-Supervision (2403.09024v1)

Published 14 Mar 2024 in cs.CL and cs.AI

Abstract: In this work, we introduce a semiparametric token-sequence co-supervision training method. It trains a LLM by simultaneously leveraging supervision from the traditional next token prediction loss which is calculated over the parametric token embedding space and the next sequence prediction loss which is calculated over the nonparametric sequence embedding space. The nonparametric sequence embedding space is constructed by a separate LLM tasked to condense an input text into a single representative embedding. Our experiments demonstrate that a model trained via both supervisions consistently surpasses models trained via each supervision independently. Analysis suggests that this co-supervision encourages a broader generalization capability across the model. Especially, the robustness of parametric token space which is established during the pretraining step tends to effectively enhance the stability of nonparametric sequence embedding space, a new space established by another LLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  2. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511.
  3. Llm augmented llms: Expanding capabilities through composition. arXiv preprint arXiv:2401.02412.
  4. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  7. Wizard of wikipedia: Knowledge-powered conversational agents. In ICLR.
  8. T-rex: A large scale alignment of natural language with knowledge base triples. In LREC.
  9. Eli5: Long form question answering. ArXiv.
  10. Enabling large language models to generate text with citations.
  11. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  12. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. ArXiv, abs/2305.11554.
  13. True: Re-evaluating factual consistency evaluation. arXiv preprint arXiv:2204.04991.
  14. Unsupervised dense information retrieval with contrastive learning.
  15. Mistral 7b. arXiv preprint arXiv:2310.06825.
  16. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the association for computational linguistics, 8:64–77.
  17. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL.
  18. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906.
  19. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations.
  20. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172.
  21. Natural questions: A benchmark for question answering research. TACL.
  22. How well do large language models truly ground? ArXiv, abs/2311.09069.
  23. Nonparametric decoding for generative retrieval. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12642–12661.
  24. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
  25. Zero-shot relation extraction via reading comprehension. In CoNLL.
  26. Retrieval-augmented generation for knowledge-intensive nlp tasks. ArXiv, abs/2005.11401.
  27. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.
  28. Ra-dit: Retrieval-augmented dual instruction tuning. ArXiv, abs/2310.01352.
  29. Visual instruction tuning.
  30. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization.
  31. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. ArXiv, abs/2212.10511.
  32. Nonparametric masked language modeling. arXiv preprint arXiv:2212.01349.
  33. Grounded compositional outputs for adaptive language modeling. arXiv preprint arXiv:2009.11523.
  34. KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2523–2544, Online. Association for Computational Linguistics.
  35. Language models are unsupervised multitask learners.
  36. Toolformer: Language models can teach themselves to use tools. ArXiv, abs/2302.04761.
  37. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
  38. Fever: a large-scale dataset for fact extraction and verification. In NACCL.
  39. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  40. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
  41. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
  42. Breaking the softmax bottleneck: A high-rank rnn language model. arXiv preprint arXiv:1711.03953.
  43. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In EMNLP.
  44. Langbridge: Multilingual reasoning without multilingual supervision. arXiv preprint arXiv:2401.10695.
  45. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
  46. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  47. Pytorch fsdp: Experiences on scaling fully sharded data parallel.
  48. Training language models with memory augmentation. arXiv preprint arXiv:2205.12674.

Summary

  • The paper demonstrates that integrating parametric token (NTP) and nonparametric sequence (NSP) supervision significantly enhances language model performance.
  • The STSC method leverages dual training to improve model generalization across both in-domain and out-of-domain datasets with a 14.2% average boost.
  • The approach exhibits a synergistic effect between embedding spaces, setting a new benchmark for robust and accurate language model training.

Semiparametric Token-Sequence Co-Supervision Enhances LLM Performance

Introduction to Semiparametric Token-Sequence Co-Supervision

The Semiparametric Token-Sequence Co-Supervision (STSC) method represents an innovative approach to training LLMs. By incorporating both parametric token embedding space supervision (Next Token Prediction, NTP) and nonparametric sequence embedding space supervision (Next Sequence Prediction, NSP), STSC aims to significantly augment the capabilities of LLMs. The nonparametric sequence embedding space is constructed by leveraging a separate LLM, specifically designated for compressing input text into a representative single embedding. Such a dual-supervision training paradigm has shown to outperform traditional methods that utilize either of the supervision techniques in isolation.

Methodological Insights

STSC integrates supervision from both a parametric and a nonparametric perspective, capitalizing on the strengths of each to enhance model performance. The underlying hypothesis suggests that LLMs can not only accommodate additional embedding spaces but can also benefit from the well-established robustness of the parametric token embedding space, especially when integrating new embedding spaces.

  • Next Token Prediction (NTP): Follows traditional LLM training by predicting the next token in a sequence.
  • Next Sequence Prediction (NSP): Extends the prediction to entire sequences, leveraging nonparametric sequence embeddings derived from another LLM.
  • Co-Supervision: Combines NTP and NSP supervisions in a multi-task training framework. This approach encourages the model to exploit both token and sequence-level embeddings concurrently.

Experimental evidence supports the broader generalization capabilities facilitated by this co-supervision, with noticeable improvements across a range of information-seeking datasets. Particularly, the semiparametric token-sequence co-supervised models demonstrate enhanced performance attributes, notably in both in-domain and out-of-domain scenarios.

Empirical Validation and Analysis

Models trained using the STSC method were rigorously evaluated against those trained using traditional supervisory approaches. Across 10 diverse information-seeking benchmarks, the STSC approach consistently yielded superior performance, with an average improvement of 14.2%.

  • Robustness of Nonparametric Space: The integration of NSP supervision alongside traditional NTP supervision contributes to the stability and robustness of the nonparametric sequence embedding space.
  • Enhanced Generalization: The co-supervised models exhibit better generalization across varying input distributions, indicating a more versatile understanding of language.
  • High Interaction Between Embedding Spaces: Evidence suggests a synergistic effect between parametric and nonparametric embedding spaces, with the co-supervised model utilizing knowledge from both to generate responses.

Theoretical and Practical Implications

The STSC method stands to revolutionize how LLMs are trained and how they interact with both parametric and nonparametric embeddings. The ability to effectively bridge these two spaces opens new avenues for constructing more powerful and nuanced AI systems capable of understanding and generating text with unprecedented accuracy and contextual awareness.

Future Outlook

The STSC framework presents a pioneering shift in LLM training paradigms. While the immediate advantages are clear, the full extent of its potential and adaptability remains open for exploration. Future research could explore the applicability of STSC across various parametric and nonparametric embeddings, extending beyond token-sequence co-supervision to include other forms of embeddings, potentially uncovering new pathways to AI advancements in natural language understanding and generation.

Conclusion

Semiparametric Token-Sequence Co-Supervision represents a significant advancement in LLM training, merging the strengths of both parametric and nonparametric approaches to push the boundaries of model performance. This method not only enhances the robustness and generalization capabilities of LLMs but also provides an insightful framework for future research in generative AI.