What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices (2409.01893v1)
Abstract: Recent advancements in LLMs with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. In order to achieve success in long context tasks, a large amount of work has been done to enhance the long context capabilities of the model through synthetic data. Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement. However, our preliminary experiments indicate that less than 35% of generated samples are multi-hop, and more than 40% exhibit poor quality, limiting comprehensive understanding and further research. To improve the quality of synthetic data, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent. This framework improves the data quality, with the proportion of high-quality, multi-hop, and diverse data exceeding 85%. Furthermore, we systematically investigate strategies for document selection, question merging, and validation techniques through extensive experiments across various models. Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human-annotated data. Our code is available at: https://github.com/WowCZ/LongMIT.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Anthropic. Model card and evaluations for claude models. 2023. URL https://www-cdn.anthropic.com/bd2a28d2535bfb0494cc8e2a3bf135d2e7523226/Model-Card-Claude-2.pdf.
- Longalign: A recipe for long context alignment of large language models, 2024a. URL https://arxiv.org/abs/2401.18058.
- Longbench: A bilingual, multitask benchmark for long context understanding, 2024b. URL https://arxiv.org/abs/2308.14508.
- Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
- M3CoT: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8199–8221, Bangkok, Thailand, August 2024a. Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.446.
- LongloRA: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=6PmJoRfdaK.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=mZn2Xyh9Ec.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 16344–16359. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf.
- Longrope: Extending llm context window beyond 2 million tokens, 2024. URL https://arxiv.org/abs/2402.13753.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- GPTScore: Evaluate as you desire. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 6556–6576, Mexico City, Mexico, June 2024a. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.365. URL https://aclanthology.org/2024.naacl-long.365.
- Data engineering for scaling language models to 128k context. In Proc. of ICML, 2024b. URL https://openreview.net/forum?id=TaAqeo7lUh.
- Quest: Query-centric data synthesis approach for long-context scaling of large language model. arXiv preprint arXiv:2405.19846, 2024.
- Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Donia Scott, Nuria Bel, and Chengqing Zong (eds.), Proc. of COLING, pp. 6609–6625, December 2020. doi: 10.18653/v1/2020.coling-main.580. URL https://aclanthology.org/2020.coling-main.580.
- Online learning for latent dirichlet allocation. advances in neural information processing systems, 23, 2010.
- Ruler: What’s the real context size of your long-context language models?, 2024. URL https://arxiv.org/abs/2404.06654.
- Tree-planner: Efficient close-loop task planning with large language models, 2023. URL https://arxiv.org/abs/2310.08582.
- Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model. arXiv preprint arXiv:2408.09559, 2024.
- Jerry Huang. How well can a long sequence model model long sequences? comparing architechtural inductive biases on long-context abilities. arXiv preprint arXiv:2407.08112, 2024.
- Llm maybe longlm: Self-extend llm context window without tuning, 2024. URL https://arxiv.org/abs/2401.01325.
- The CoT collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12685–12708, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.782. URL https://aclanthology.org/2023.emnlp-main.782.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
- Making long-context language models better multi-hop reasoners. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proc. of ACL, pp. 2462–2475, August 2024. URL https://aclanthology.org/2024.acl-long.135.
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024a.
- Chatqa: Building gpt-4 level conversational qa models. arXiv preprint arXiv:2401.10225, 2024b.
- YaRN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZu1u.
- Cross-lingual prompting: Improving zero-shot chain-of-thought reasoning across languages. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2695–2709, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.163. URL https://aclanthology.org/2023.emnlp-main.163.
- Large language models meet nlp: A survey. arXiv preprint arXiv:2405.12819, 2024.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
- Okapi at trec-3. Nist Special Publication Sp, 109:109, 1995.
- In-context pretraining: Language modeling beyond document boundaries. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=LXVswInHOo.
- MuSiQue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics, 10:539–554, 05 2022. ISSN 2307-387X. doi: 10.1162/tacl˙a˙00475. URL https://doi.org/10.1162/tacl_a_00475.
- Is ChatGPT a good NLG evaluator? a preliminary study. In Yue Dong, Wen Xiao, Lu Wang, Fei Liu, and Giuseppe Carenini (eds.), Proceedings of the 4th New Frontiers in Summarization Workshop, pp. 1–11, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.newsum-1.1. URL https://aclanthology.org/2023.newsum-1.1.
- Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13484–13508, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
- Chain-of-thought prompting elicits reasoning in large language models. In Proc. of NeurIPS, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.
- C-pack: Packaged resources to advance general chinese embedding, 2023.
- Effective long-context scaling of foundation models, 2023. URL https://arxiv.org/abs/2309.16039.
- Effective long-context scaling of foundation models. In Proc. of the NAACL, June 2024.
- Chatqa 2: Bridging the gap to proprietary llms in long context and rag capabilities. arXiv preprint arXiv:2407.14482, 2024a.
- Concise and precise context compression for tool-using language models. arXiv preprint arXiv:2407.02043, 2024b.
- Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proc. of EMNLP, pp. 2369–2380, October-November 2018. doi: 10.18653/v1/D18-1259. URL https://aclanthology.org/D18-1259.
- Extending llms’ context window with 100 samples, 2024. URL https://arxiv.org/abs/2401.07004.
- Zhi Chen (235 papers)
- Qiguang Chen (44 papers)
- Libo Qin (77 papers)
- Qipeng Guo (72 papers)
- Haijun Lv (5 papers)
- Yicheng Zou (20 papers)
- Wanxiang Che (152 papers)
- Hang Yan (86 papers)
- Kai Chen (512 papers)
- Dahua Lin (336 papers)