Semiparametric Token-Sequence Co-Supervision (2403.09024v1)
Abstract: In this work, we introduce a semiparametric token-sequence co-supervision training method. It trains a LLM by simultaneously leveraging supervision from the traditional next token prediction loss which is calculated over the parametric token embedding space and the next sequence prediction loss which is calculated over the nonparametric sequence embedding space. The nonparametric sequence embedding space is constructed by a separate LLM tasked to condense an input text into a single representative embedding. Our experiments demonstrate that a model trained via both supervisions consistently surpasses models trained via each supervision independently. Analysis suggests that this co-supervision encourages a broader generalization capability across the model. Especially, the robustness of parametric token space which is established during the pretraining step tends to effectively enhance the stability of nonparametric sequence embedding space, a new space established by another LLM.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Self-RAG: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511.
- Llm augmented llms: Expanding capabilities through composition. arXiv preprint arXiv:2401.02412.
- Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Wizard of wikipedia: Knowledge-powered conversational agents. In ICLR.
- T-rex: A large scale alignment of natural language with knowledge base triples. In LREC.
- Eli5: Long form question answering. ArXiv.
- Enabling large language models to generate text with citations.
- Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
- Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. ArXiv, abs/2305.11554.
- True: Re-evaluating factual consistency evaluation. arXiv preprint arXiv:2204.04991.
- Unsupervised dense information retrieval with contrastive learning.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Spanbert: Improving pre-training by representing and predicting spans. Transactions of the association for computational linguistics, 8:64–77.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL.
- Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906.
- On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations.
- Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172.
- Natural questions: A benchmark for question answering research. TACL.
- How well do large language models truly ground? ArXiv, abs/2311.09069.
- Nonparametric decoding for generative retrieval. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12642–12661.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
- Zero-shot relation extraction via reading comprehension. In CoNLL.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. ArXiv, abs/2005.11401.
- Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.
- Ra-dit: Retrieval-augmented dual instruction tuning. ArXiv, abs/2310.01352.
- Visual instruction tuning.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization.
- When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. ArXiv, abs/2212.10511.
- Nonparametric masked language modeling. arXiv preprint arXiv:2212.01349.
- Grounded compositional outputs for adaptive language modeling. arXiv preprint arXiv:2009.11523.
- KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2523–2544, Online. Association for Computational Linguistics.
- Language models are unsupervised multitask learners.
- Toolformer: Language models can teach themselves to use tools. ArXiv, abs/2302.04761.
- Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
- Fever: a large-scale dataset for fact extraction and verification. In NACCL.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
- Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
- Breaking the softmax bottleneck: A high-rank rnn language model. arXiv preprint arXiv:1711.03953.
- Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In EMNLP.
- Langbridge: Multilingual reasoning without multilingual supervision. arXiv preprint arXiv:2401.10695.
- Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Pytorch fsdp: Experiences on scaling fully sharded data parallel.
- Training language models with memory augmentation. arXiv preprint arXiv:2205.12674.