Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
86 tokens/sec
Gemini 2.5 Pro Premium
51 tokens/sec
GPT-5 Medium
22 tokens/sec
GPT-5 High Premium
34 tokens/sec
GPT-4o
83 tokens/sec
DeepSeek R1 via Azure Premium
91 tokens/sec
GPT OSS 120B via Groq Premium
471 tokens/sec
Kimi K2 via Groq Premium
203 tokens/sec
2000 character limit reached

MDCure: A Scalable Pipeline for Multi-Document Instruction-Following (2410.23463v3)

Published 30 Oct 2024 in cs.CL and cs.LG

Abstract: Multi-document (MD) processing is crucial for LLMs to handle real-world tasks such as summarization and question-answering across large sets of documents. While LLMs have improved at processing long inputs, MD contexts still present unique difficulties, including management of inter-document dependencies, redundancy, and incoherent structures. To address this challenge, we introduce MDCure, a scalable and effective instruction data generation framework to enhance the MD capabilities of LLMs without the computational cost of pre-training or reliance on human-annotated data. MDCure generates high-quality synthetic MD instruction data over sets of articles via targeted prompts. We also introduce MDCureRM, a cost-effective, MD-specific reward model to score and filter generated data based on their training utility for MD settings. MDCure is compatible with open- and closed-source models in addition to policy optimization methods such as PPO, enabling even small open-source models to surpass proprietary LLMs as strong generators of high-quality MD instruction data without further data filtering. With MDCure, we fine-tune a wide variety of LLMs up to 70B parameters in size from the FlanT5, Qwen2, and LLAMA3.1 model families. Extensive evaluations on a wide range of MD and long-context benchmarks spanning various tasks and domains show MDCure consistently improves performance over pre-trained baselines and base models by up to 75.1%. Our code, datasets, and models are available at https://github.com/yale-nlp/MDCure.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (114)
  1. Can’t remember details in long documents? you need some r&r. Preprint, arXiv:2403.05004.
  2. CoLT5: Faster long-range transformers with conditional computation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5085–5100, Singapore. Association for Computational Linguistics.
  3. Training-free long-context scaling of large language models. Preprint, arXiv:2402.17463.
  4. Make your llm fully utilize the context. Preprint, arXiv:2404.16811.
  5. Extractive opinion summarization in quantized transformer spaces. Transactions of the Association for Computational Linguistics, 9:277–293.
  6. Longalign: A recipe for long context alignment of large language models. ArXiv, abs/2401.18058.
  7. Longformer: The long-document transformer. Preprint, arXiv:2004.05150.
  8. Ralph Allan Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324.
  9. CDLM: Cross-document language modeling. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2648–2662, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  10. Long context question answering via supervised contrastive learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2872–2879, Seattle, United States. Association for Computational Linguistics.
  11. Peek across: Improving multi-document modeling via cross-document question-answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1970–1989, Toronto, Canada. Association for Computational Linguistics.
  12. Realistic evaluation principles for cross-document coreference resolution. In Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics, pages 143–151, Online. Association for Computational Linguistics.
  13. Scico: Hierarchical cross-document coreference for scientific concepts. In 3rd Conference on Automated Knowledge Base Construction.
  14. Balancing cost and effectiveness of synthetic data generation strategies for llms. Preprint, arXiv:2409.19759.
  15. Walking down the memory maze: Beyond context limit through interactive reading. Preprint, arXiv:2310.05029.
  16. Genqa: Generating millions of instructions from a handful of prompts. Preprint, arXiv:2406.10323.
  17. Long context is not long at all: A prospector of long-dependency data for large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8222–8234, Bangkok, Thailand. Association for Computational Linguistics.
  18. SummScreen: A dataset for abstractive screenplay summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8602–8615, Dublin, Ireland. Association for Computational Linguistics.
  19. What are the essential factors in crafting effective long context multi-hop instruction datasets? insights and best practices. Preprint, arXiv:2409.01893.
  20. Eric Chu and Peter Liu. 2019. MeanSum: A neural model for unsupervised multi-document abstractive summarization. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1223–1232. PMLR.
  21. Scaling instruction-finetuned language models. Preprint, arXiv:2210.11416.
  22. Training verifiers to solve math word problems. ArXiv, abs/2110.14168.
  23. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610, Online. Association for Computational Linguistics.
  24. A survey for in-context learning. ArXiv, abs/2301.00234.
  25. Mods: Model-oriented data selection for instruction tuning. Preprint, arXiv:2311.15653.
  26. The llama 3 herd of models. Preprint, arXiv:2407.21783.
  27. Cross-document event coreference search: Task, dataset and modeling. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 900–913, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  28. WEC: Deriving a large-scale cross-document event coreference dataset from Wikipedia. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2498–2510, Online. Association for Computational Linguistics.
  29. Proposition-level clustering for multi-document summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1765–1779, Seattle, United States. Association for Computational Linguistics.
  30. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074–1084, Florence, Italy. Association for Computational Linguistics.
  31. Enabling large language models to generate text with citations.
  32. How to train long-context language models (effectively). Preprint, arXiv:2410.02660.
  33. A little goes a long way: Efficient long context training and inference with partial contexts. Preprint, arXiv:2410.01485.
  34. Alireza Ghadimi and Hamid Beigy. 2023. Sgcsumm: An extractive multi-document summarization method based on pre-trained language model, submodularity, and graph convolutional neural networks. Expert Systems with Applications, 215:119308.
  35. Open domain multi-document summarization: A comprehensive study of model brittleness under retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8177–8199, Singapore. Association for Computational Linguistics.
  36. Chatglm: A family of large language models from glm-130b to glm-4 all tools. Preprint, arXiv:2406.12793.
  37. Generating Representative Headlines for News Stories. In Proc. of the the Web Conf. 2020.
  38. Longt5: Efficient text-to-text transformer for long sequences. Preprint, arXiv:2112.07916.
  39. Reducing redundancy in multi-document summarization using lexical semantic similarity. In Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009), pages 63–66, Suntec, Singapore. Association for Computational Linguistics.
  40. Multi-document summarization: A comparative evaluation. In 2023 IEEE 17th International Conference on Industrial and Information Systems (ICIIS), volume 21, page 19–24. IEEE.
  41. Unnatural instructions: Tuning language models with (almost) no human labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada. Association for Computational Linguistics.
  42. Efficient solutions for an intriguing failure of llms: Long context window does not mean llms can analyze long sequences flawlessly. Preprint, arXiv:2408.01866.
  43. Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1419–1436, Online. Association for Computational Linguistics.
  44. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702.
  45. Mistral 7b. arXiv preprint arXiv:2310.06825.
  46. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
  47. BOOKSUM: A collection of datasets for long-form narrative summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6536–6558, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  48. Litton J Kurisinkel and Nancy F chen. 2023. Controllable multi-document summarization: Coverage & coherence intuitive policy with large language model based rewards. Preprint, arXiv:2310.03473.
  49. Litton J Kurisinkel and Nancy F. Chen. 2023. Llm based multi-document summarization exploiting main-event biased monotone submodular content extraction. Preprint, arXiv:2310.03414.
  50. M4LE: A multi-ability multi-range multi-task multi-domain long-context evaluation benchmark for large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15568–15592, Bangkok, Thailand. Association for Computational Linguistics.
  51. Longform: Effective instruction tuning with reverse instructions. Preprint, arXiv:2304.08460.
  52. AI21 Labs. 2024. [link].
  53. Graph-enhanced multi-answer summarization under question-driven guidance. J. Supercomput., 79(18):20417–20444.
  54. Loogle: Can long-context language models understand long contexts? arXiv preprint arXiv:2311.04939.
  55. Self-alignment with instruction backtranslation. In The Twelfth International Conference on Learning Representations.
  56. Seam: A stochastic benchmark for multi-document tasks. Preprint, arXiv:2406.16086.
  57. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173.
  58. G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
  59. Benchmarking generation and evaluation capabilities of large language models for instruction controllable summarization. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 4481–4501, Mexico City, Mexico. Association for Computational Linguistics.
  60. Reife: Re-evaluating instruction-following evaluation. Preprint, arXiv:2410.07069.
  61. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
  62. Multi-XScience: A large-scale dataset for extreme multi-document summarization of scientific articles. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8068–8074, Online. Association for Computational Linguistics.
  63. Source2synth: Synthetic data generation and curation grounded in real data sources. Preprint, arXiv:2409.08239.
  64. Multi-document summarization with maximal marginal relevance-guided reinforcement learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1737–1751, Online. Association for Computational Linguistics.
  65. Gpt-4 technical report. Preprint, arXiv:2303.08774.
  66. QuALITY: Question answering with long input texts, yes! In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5336–5358, Seattle, United States. Association for Computational Linguistics.
  67. Data augmentation for abstractive query-focused multi-document summarization. In AAAI Conference on Artificial Intelligence.
  68. PELMS: Pre-training for effective low-shot multi-document summarization. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7652–7674, Mexico City, Mexico. Association for Computational Linguistics.
  69. Dragomir Radev. 2000. A common theory of information fusion from multiple text sources step one: Cross-document structure. In 1st SIGdial Workshop on Discourse and Dialogue, pages 74–83, Hong Kong, China. Association for Computational Linguistics.
  70. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  71. How to train data-efficient llms. Preprint, arXiv:2402.09668.
  72. Re-evaluating adem: a deeper look at scoring dialogue responses. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’19/IAAI’19/EAAI’19. AAAI Press.
  73. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
  74. ZeroSCROLLS: A zero-shot benchmark for long text understanding. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7977–7989, Singapore. Association for Computational Linguistics.
  75. SCROLLS: Standardized CompaRison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 12007–12021, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  76. In-context pretraining: Language modeling beyond document boundaries. arXiv preprint arXiv:2310.10638.
  77. Instruction tuning with loss over instructions.
  78. A new pipeline for generating instruction dataset via rag and self fine-tuning. Preprint, arXiv:2408.05911.
  79. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Preprint, arXiv:2206.04615.
  80. L-citeeval: Do long-context models truly leverage context for responding? Preprint, arXiv:2410.02115.
  81. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. Preprint, arXiv:2403.05530.
  82. D4: improving llm pretraining via document de-duplication and diversification. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA. Curran Associates Inc.
  83. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
  84. MuSiQue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554.
  85. Multi-hop reading comprehension across multiple documents by reasoning over heterogeneous graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2704–2713, Florence, Italy. Association for Computational Linguistics.
  86. SQuALITY: Building a long-document summarization dataset the hard way. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1139–1156, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  87. Ada-leval: Evaluating long-context llms with length-adaptable benchmarks. Preprint, arXiv:2404.06480.
  88. Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards. In ACL.
  89. Interpretable preferences via multi-objective reward modeling and mixture-of-experts. In EMNLP.
  90. Leave no document behind: Benchmarking long-context llms with extended multi-doc qa. arXiv preprint arXiv:2406.17419.
  91. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  92. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  93. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6:287–302.
  94. How “multi” is multi-document summarization? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5761–5769, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  95. Long context alignment with short instructions and synthesized positions. Preprint, arXiv:2405.03939.
  96. Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693.
  97. Efficient streaming language models with attention sinks. ICLR.
  98. PRIMERA: Pyramid-based masked sentence pre-training for multi-document summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5245–5263, Dublin, Ireland. Association for Computational Linguistics.
  99. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint. Preprint, arXiv:2312.11456.
  100. Retrieval meets long context large language models. In The Twelfth International Conference on Learning Representations.
  101. Qwen2 technical report. Preprint, arXiv:2407.10671.
  102. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
  103. LinkBERT: Pretraining language models with document links. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8003–8016, Dublin, Ireland. Association for Computational Linguistics.
  104. Flask: Fine-grained language model evaluation based on alignment skill sets. Preprint, arXiv:2307.10928.
  105. Helmet: How to evaluate long-context language models effectively and thoroughly. Preprint, arXiv:2410.02694.
  106. Self-rewarding language models. arXiv preprint arXiv:2401.10020.
  107. Mammoth2: Scaling instructions from the web. Preprint, arXiv:2405.03548.
  108. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 11328–11339. PMLR.
  109. Self-guide: Better task-specific instruction following via self-synthetic finetuning. In The Conference on Language Modeling (COLM).
  110. Longagent: Scaling language models to 128k context through multi-agent collaboration. Preprint, arXiv:2402.11550.
  111. QMSum: A new benchmark for query-based multi-domain meeting summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5905–5921, Online. Association for Computational Linguistics.
  112. LIMA: Less is more for alignment. In Thirty-seventh Conference on Neural Information Processing Systems.
  113. Fanno: Augmenting high-quality instruction data with open-sourced llms only. Preprint, arXiv:2408.01323.
  114. Craft your dataset: Task-specific synthetic dataset generation through corpus retrieval and augmentation. Preprint, arXiv:2409.02098.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube