Marathon: A Race Through the Realm of Long Context with Large Language Models
Abstract: With the advancement of LLMs and the expansion of their context windows, existing long-context benchmarks fall short in effectively evaluating the models' comprehension and reasoning abilities in extended texts. Moreover, conventional benchmarks relying on F1 metrics often inaccurately score responses: they may undervalue correct answers that differ from the reference responses and overvalue incorrect ones that resemble the reference texts. In response to these limitations, we introduce Marathon, a novel evaluation benchmark that adopts a multiple-choice question format. It is specifically designed to overcome the constraints of previous benchmarks and provide a rapid, precise, and unbiased appraisal of the long-context comprehension skills of LLMs. We conducted comprehensive evaluations on the Marathon benchmark with a range of state-of-the-art LLMs and assessed the effectiveness of various optimization strategies tailored for long-context generation. We anticipate that the Marathon benchmark and its associated leaderboard will enable a more precise and equitable evaluation of LLMs' capabilities in understanding and reasoning over extended contexts. Marathon is available at https://github.com/Hambaobao/Marathon.
- 01.AI. 2023. Yi-series.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
- Longformer: The long-document transformer.
- Evaluating large language models trained on code.
- Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
- Alfred-40b-1023.
- Measuring massive multitask language understanding.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.
- Camels in a changing climate: Enhancing lm adaptation with tulu 2.
- Mistral 7b.
- Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. ArXiv preprint, abs/2310.06839.
- Llmlingua: Compressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc.
- How long can open-source llms truly promise on context length?
- Can long-context language models understand long contexts?
- Jerry Liu. 2022. LlamaIndex.
- Lost in the middle: How language models use long contexts. ArXiv:2307.03172.
- longchain-ai longchain. 2022. LangChain.
- Stable beluga models.
- Ha-Thanh Nguyen. 2023. A brief report on lawgpt 1.0: A virtual legal assistant based on gpt-3.
- OpenAI. 2023a. Chatgpt. https://chat.openai.com.
- OpenAI. 2023b. Gpt-4 technical report.
- Roformer: Enhanced transformer with rotary position embedding.
- MosaicMLÂ NLP Team. 2023. Introducing mpt-7b: A new standard for open-source, commercially usable llms. Accessed: 2023-05-05.
- Zephyr: Direct distillation of lm alignment.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.