CorpusBrain++: A Continual Generative Pre-Training Framework for Knowledge-Intensive Language Tasks (2402.16767v1)
Abstract: Knowledge-intensive language tasks (KILTs) typically require retrieving relevant documents from trustworthy corpora, e.g., Wikipedia, to produce specific answers. Very recently, a pre-trained generative retrieval model for KILTs, named CorpusBrain, was proposed and reached new state-of-the-art retrieval performance. However, most existing research on KILTs, including CorpusBrain, has predominantly focused on a static document collection, overlooking the dynamic nature of real-world scenarios, where new documents are continuously being incorporated into the source corpus. To address this gap, it is crucial to explore the capability of retrieval models to effectively handle the dynamic retrieval scenario inherent in KILTs. In this work, we first introduce the continual document learning (CDL) task for KILTs and build a novel benchmark dataset named KILT++ based on the original KILT dataset for evaluation. Then, we conduct a comprehensive study over the use of pre-trained CorpusBrain on KILT++. Unlike the promising results in the stationary scenario, CorpusBrain is prone to catastrophic forgetting in the dynamic scenario, hence hampering the retrieval performance. To alleviate this issue, we propose CorpusBrain++, a continual generative pre-training framework. Empirical results demonstrate the significant effectiveness and remarkable efficiency of CorpusBrain++ in comparison to both traditional and generative IR methods.
- On the Evolution of Wikipedia. In ICWSM.
- Guided Open Vocabulary Image Captioning with Constrained Beam Search. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 936–945.
- Autoregressive Search Engines: Generating Substrings as Document Identifiers. Advances in Neural Information Processing Systems 35 (2022), 31668–31683.
- Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1870–1879.
- Continual Learning for Generative Retrieval over Dynamic Corpora. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 306–315.
- FedMatch: Federated Learning over Heterogeneous Question Answering Data. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 181–190.
- GERE: Generative Evidence Retrieval for Fact Verification. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2184–2189.
- CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 191–200.
- Discourse Production and Comprehension. In Discourse Processes: Advances in Research and Theory. Ablex Publishing Corporation.
- Herbert H. Clark and Susan E. Haviland. 1974. Psychological Processes as Linguistic Explanation. Explaining Linguistic Phenomena (1974), 91–124.
- Zhuyun Dai and Jamie Callan. 2020. Context-aware Term Weighting for First Stage Passage Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1533–1536.
- Autoregressive Entity Retrieval. arXiv preprint arXiv:2010.00904 (2020).
- A Continual Learning Survey: Defying Forgetting in Classification Tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 7 (2021), 3366–3385.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.
- ELI5: Long Form Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 3558–3567.
- Learning Term Discrimination. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1993–1996.
- Span Selection Pre-training for Question Answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2773–2782.
- The Birth of Collective Memories: Analyzing Emerging Entities in Text Streams. Journal of the Association for Information Science and Technology 69, 6 (June 2018), 773–786.
- A Deep Look into Neural Ranking Models for Information Retrieval. Information Processing & Management 57, 6 (2020), 102067.
- Greg Hamerly and Charles Elkan. 2003. Learning the k𝑘kitalic_k in k𝑘kitalic_k-means. Advances in Neural Information Processing Systems 16 (2003).
- Susan E. Haviland and Herbert H. Clark. 1974. What’s New? Acquiring New Information as a Process in Comprehension. Journal of Verbal Learning and Verbal Behavior 13, 5 (1974), 512–521.
- Parameter-efficient Transfer Learning for NLP. In International Conference on Machine Learning. PMLR, 2790–2799.
- TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1601–1611.
- Dense Passage Retrieval for Open-domain Question answering. arXiv preprint arXiv:2004.04906 (2020).
- SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6975–6988.
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 (2014).
- Overcoming Catastrophic Forgetting in Neural Networks. Proceedings of the National Academy of Sciences 114, 13 (2017), 3521–3526.
- Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics 7 (2019), 452–466.
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7871–7880.
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
- Zhizhong Li and Derek Hoiem. 2017. Learning without Forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 12 (2017), 2935–2947.
- David Lopez-Paz and Marc’Aurelio Ranzato. 2017. Gradient Episodic Memory for Continual Learning. Advances in Neural Information Processing Systems 30 (2017).
- Arun Mallya and Svetlana Lazebnik. 2018. PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7765–7773.
- DSI++: Updating Transformer Memory with New Documents. arXiv preprint arXiv:2212.09744 (2022).
- Rethinking Search: Making Domain Experts Out of Dilettantes. In ACM SIGIR Forum, Vol. 55. ACM New York, NY, USA, 1–27.
- In Jae Myung. 2003. Tutorial on Maximum Likelihood Estimation. Journal of Mathematical Psychology 47, 1 (2003), 90–100.
- KILT: A Benchmark for Knowledge Intensive Language Tasks. arXiv preprint arXiv:2009.02252 (2020).
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21 (2020), 1–67.
- Juan Ramos. 2003. Using TF-IDF to Determine Word Relevance in Document Queries. In Proceedings of the first Instructional Conference on Machine Learning, Vol. 242. 29–48.
- iCaRL: Incremental Classifier and Representation Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2001–2010.
- Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval 3, 4 (2009), 333–389.
- Progressive Neural Networks. arXiv preprint arXiv:1606.04671 (2016).
- Continual Learning with Deep Generative Replay. Advances in Neural Information Processing Systems 30 (2017).
- Amit Singhal. 2001. Modern Information Retrieval: A Brief Overview. IEEE Data Eng. Bull. 24, 4 (2001), 35–43.
- Asa Cooper Stickland and Iain Murray. 2019. BERT and Pals: Projected Attention Layers for Efficient Adaptation in Multi-task Learning. In International Conference on Machine Learning. PMLR, 5986–5995.
- Continual Domain Adaptation for Machine Reading Comprehension. In Proceedings of the 29th ACM international conference on information & knowledge management. 1395–1404.
- Learning to Tokenize for Generative Retrieval. arXiv preprint arXiv:2304.04171 (2023).
- Transformer Memory as a Differentiable Search Index. Advances in Neural Information Processing Systems 35 (2022), 21831–21843.
- FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 809–819.
- Gido M. Van de Ven and Andreas S. Tolias. 2019. Three Scenarios for Continual Learning. arXiv preprint arXiv:1904.07734 (2019).
- Attention is All You Need. Advances in Neural Information Processing Systems 30 (2017).
- NOVO: Learnable and Interpretable Document Identifiers for Model-Based IR. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 2656–2665.
- Scalable Zero-shot Entity Linking with Dense Entity Retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6397–6407.
- HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2369–2380.
- Continually Updating Generative Retrieval on Dynamic Corpora. arXiv preprint arXiv:2305.18952 (2023).
- Pegasus: Pre-training with Extracted Gap-sentences for Abstractive Summarization. In International Conference on Machine Learning. PMLR, 11328–11339.
- Dense Text Retrieval based on Pretrained Language Models: A Survey. arXiv preprint arXiv:2211.14876 (2022).
- Enhancing Generative Retrieval with Reinforcement Learning from Relevance Feedback. In The 2023 Conference on Empirical Methods in Natural Language Processing.
- Jiafeng Guo (161 papers)
- Changjiang Zhou (4 papers)
- Ruqing Zhang (60 papers)
- Jiangui Chen (8 papers)
- Maarten de Rijke (263 papers)
- Yixing Fan (55 papers)
- Xueqi Cheng (274 papers)