Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval (2306.13421v2)
Abstract: Retrieval-augmented LLMs (LMs) have received much attention recently. However, typically the retriever is not trained jointly as a native component of the LM, but added post-hoc to an already-pretrained LM, which limits the ability of the LM and the retriever to adapt to one another. In this work, we propose the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch and apply it to the task of modeling long texts. Given a recently generated text chunk in a long document, the LM computes query representations, which are then used to retrieve earlier chunks in the document, located potentially tens of thousands of tokens before. Information from retrieved chunks is fused into the LM representations to predict the next target chunk. We train the retriever component with a semantic objective, where the goal is to retrieve chunks that increase the probability of the next chunk, according to a reference LM. We evaluate RPT on four long-range LLMing tasks, spanning books, code, and mathematical writing, and demonstrate that RPT improves retrieval quality and subsequently perplexity across the board compared to strong baselines.
- Qampari: An open-domain question answering benchmark for questions with many answers from multiple paragraphs.
- Proof-Pile: A Pre-training Dataset of Mathematical Text. https://huggingface.co/datasets/hoskinson-center/proof-pile.
- Longformer: The long-document transformer. arXiv:2004.05150.
- Scheduled sampling for sequence prediction with recurrent neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 1171–1179, Cambridge, MA, USA. MIT Press.
- Unlimiformer: Long-range transformers with unlimited length input.
- Pythia: A suite for analyzing large language models across training and scaling.
- GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of the ACL Workshop on Challenges & Perspectives in Creating Large Language Models.
- Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR.
- Language models are few-shot learners.
- Learning to rank with nonsmooth cost functions. In Advances in Neural Information Processing Systems, volume 19. MIT Press.
- Palm: Scaling language modeling with pathways.
- Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy. Association for Computational Linguistics.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota.
- Surface-based retrieval reduces perplexity of retrieval-augmented language models.
- Addressing some limitations of transformers with feedback memory.
- The pile: An 800gb dataset of diverse text for language modeling. ArXiv preprint, abs/2101.00027.
- Luyu Gao and Jamie Callan. 2022. Unsupervised corpus aware language model pre-training for dense passage retrieval. Association for Computational Linguistics.
- Scaling deep contrastive learning batch size under memory limited setup. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021). Association for Computational Linguistics.
- Realm: Retrieval-augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org.
- k𝑘kitalic_knn-adapter: Efficient domain adaptation for black-box language models.
- Block-recurrent transformers. In Advances in Neural Information Processing Systems.
- Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res., 2022.
- Gautier Izacard and Edouard Grave. 2021a. Distilling knowledge from reader to retriever for question answering. In International Conference on Learning Representations.
- Gautier Izacard and Edouard Grave. 2021b. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
- Atlas: Few-shot learning with retrieval augmented language models.
- Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems.
- Retrieval as attention: End-to-end learning of retrieval and reading within a single transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2336–2349, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
- Generalization through memorization: Nearest neighbor language models. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of ICLR.
- Reformer: The efficient transformer. In International Conference on Learning Representations.
- Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates, Inc.
- ∞\infty∞-former: Infinite memory transformer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland. Association for Computational Linguistics.
- Long range language modeling via gated state spaces. In The Eleventh International Conference on Learning Representations.
- Shortformer: Better language modeling using shorter inputs. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
- Ofir Press and Lior Wolf. 2017. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 157–163, Valencia, Spain. Association for Computational Linguistics.
- Jack Rae and Ali Razavi. 2020. Do transformers need deep long-range memory? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7524–7529, Online. Association for Computational Linguistics.
- Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations.
- In-context retrieval-augmented language models.
- Learning to retrieve passages without supervision. Association for Computational Linguistics.
- Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval, 3:333–389.
- Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics.
- Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, Seattle, United States. Association for Computational Linguistics.
- Improving passage retrieval with zero-shot question generation. In Conference on Empirical Methods in Natural Language Processing.
- End-to-end training of multi-document reader and retriever for open-domain question answering. In Advances in Neural Information Processing Systems.
- Simple entity-centric questions challenge dense retrievers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Replug: Retrieval-augmented black-box language models.
- Roformer: Enhanced transformer with rotary position embedding.
- Not all memories are created equal: Learning to expire.
- Do long-range language models actually use long-range context? ArXiv, abs/2109.09115.
- Llama: Open and efficient foundation language models.
- Shall we pretrain autoregressive language models with retrieval? a comprehensive study.
- A dataset of python files from github. https://github.com/huggingface/blog/blob/main/codeparrot.md version=codeparrot/codeparrot-train-v2-near-dedup.
- Memorizing transformers. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Adaptive semiparametric language models. Transactions of the Association for Computational Linguistics, 9:362–373.
- Big bird: Transformers for longer sequences. In Proceedings of the 34th International Conference on Neural Information Processing Systems.
- Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068.
- Training language models with memory augmentation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5657–5673, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Conference on Neural Information Processing Systems.
- Ohad Rubin (4 papers)
- Jonathan Berant (107 papers)