Retrieve-Cluster-Summarize: An Alternative to End-to-End Training for Query-specific Article Generation (2310.12361v1)
Abstract: Query-specific article generation is the task of, given a search query, generate a single article that gives an overview of the topic. We envision such articles as an alternative to presenting a ranking of search results. While generative LLMs like chatGPT also address this task, they are known to hallucinate new information, their models are secret, hard to analyze and control. Some generative LLMs provide supporting references, yet these are often unrelated to the generated content. As an alternative, we propose to study article generation systems that integrate document retrieval, query-specific clustering, and summarization. By design, such models can provide actual citations as provenance for their generated text. In particular, we contribute an evaluation framework that allows to separately trains and evaluate each of these three components before combining them into one system. We experimentally demonstrate that a system comprised of the best-performing individual components also obtains the best F-1 overall system quality.
- Frontiers, challenges, and opportunities for information retrieval: Report from SWIRL 2012 the second strategic workshop on information retrieval in Lorne. In Acm sigir forum, Vol. 46. ACM New York, NY, USA, 2–32.
- Siddhartha Banerjee and Prasenjit Mitra. 2015. WikiKreator: Improving Wikipedia Stubs Automatically. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Beijing, China, 867–877. https://doi.org/10.3115/v1/P15-1084
- Full-subtopic retrieval with keyphrase-based search results clustering. In 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, Vol. 1. IEEE, 206–213.
- Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
- Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008, 10 (2008), P10008.
- Claudio Carpineto and Giovanni Romano. 2012a. Consensus clustering based on a new probabilistic rand index with application to subtopic retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 12 (2012), 2315–2326.
- Claudio Carpineto and Giovanni Romano. 2012b. A survey of automatic query expansion in information retrieval. Acm Computing Surveys (CSUR) 44, 1 (2012), 1–50.
- Vincent Claveau. 2021. Neural text generation for query expansion in information retrieval. In IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. 202–209.
- Wikimarks: Harvesting Relevance Benchmarks from Wikipedia. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3003–3012.
- Laura Dietz and Jeff Dalton. 2020. Humans optional? automatic large-scale test collections for entity, passage, and entity-passage retrieval. Datenbank-Spektrum 20, 1 (2020), 17–28.
- TREC Complex Answer Retrieval Overview.. In TREC.
- How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection. arXiv preprint arXiv:2301.07597 (2023).
- WikiAsp: A Dataset for Multi-domain Aspect-based Summarization. Transactions of the Association for Computational Linguistics 9 (2021), 211–225.
- Atlas: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv 2208 (2022).
- Sumanta Kashyapi and Laura Dietz. 2022. Query-specific Subtopic Clustering. ACM/IEEE Joint Conference on Digital Libraries (JCDL (2022).
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
- Leveraging Graph to Improve Abstractive Multi-Document Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6232–6243.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
- Chin-Yew Lin and Eduard Hovy. 2002. From single to multi-document summarization. In Proceedings of the 40th annual meeting of the association for computational linguistics. 457–464.
- Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198 (2018).
- Yang Liu and Mirella Lapata. 2019. Hierarchical Transformers for Multi-Document Summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5070–5081.
- Multi-document summarization via deep learning techniques: A survey. arXiv preprint arXiv:2011.04843 (2020).
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems.
- Efficiently summarizing text and graph encodings of multi-document clusters. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4768–4779.
- The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models. arXiv preprint arXiv:2101.05667 (2021).
- Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
- Fiana Raiber and Oren Kurland. 2013. Ranking document clusters using markov random fields. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 333–342.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3982–3992.
- Search result diversification. Foundations and Trends® in Information Retrieval 9, 1 (2015), 1–90.
- Christina Sauper and Regina Barzilay. 2009. Automatically Generating Wikipedia Articles: A Structure-Aware Approach. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, Suntec, Singapore, 208–216. https://aclanthology.org/P09-1024
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.