Papers
Topics
Authors
Recent
Search
2000 character limit reached

Marathon: A Race Through the Realm of Long Context with Large Language Models

Published 15 Dec 2023 in cs.CL | (2312.09542v2)

Abstract: With the advancement of LLMs and the expansion of their context windows, existing long-context benchmarks fall short in effectively evaluating the models' comprehension and reasoning abilities in extended texts. Moreover, conventional benchmarks relying on F1 metrics often inaccurately score responses: they may undervalue correct answers that differ from the reference responses and overvalue incorrect ones that resemble the reference texts. In response to these limitations, we introduce Marathon, a novel evaluation benchmark that adopts a multiple-choice question format. It is specifically designed to overcome the constraints of previous benchmarks and provide a rapid, precise, and unbiased appraisal of the long-context comprehension skills of LLMs. We conducted comprehensive evaluations on the Marathon benchmark with a range of state-of-the-art LLMs and assessed the effectiveness of various optimization strategies tailored for long-context generation. We anticipate that the Marathon benchmark and its associated leaderboard will enable a more precise and equitable evaluation of LLMs' capabilities in understanding and reasoning over extended contexts. Marathon is available at https://github.com/Hambaobao/Marathon.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. 01.AI. 2023. Yi-series.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609.
  3. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
  4. Longformer: The long-document transformer.
  5. Evaluating large language models trained on code.
  6. Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning.
  7. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems.
  8. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  9. Alfred-40b-1023.
  10. Measuring massive multitask language understanding.
  11. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.
  12. Camels in a changing climate: Enhancing lm adaptation with tulu 2.
  13. Mistral 7b.
  14. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. ArXiv preprint, abs/2310.06839.
  15. Llmlingua: Compressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  16. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics.
  17. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc.
  18. How long can open-source llms truly promise on context length?
  19. Can long-context language models understand long contexts?
  20. Jerry Liu. 2022. LlamaIndex.
  21. Lost in the middle: How language models use long contexts. ArXiv:2307.03172.
  22. longchain-ai longchain. 2022. LangChain.
  23. Stable beluga models.
  24. Ha-Thanh Nguyen. 2023. A brief report on lawgpt 1.0: A virtual legal assistant based on gpt-3.
  25. OpenAI. 2023a. Chatgpt. https://chat.openai.com.
  26. OpenAI. 2023b. Gpt-4 technical report.
  27. Roformer: Enhanced transformer with rotary position embedding.
  28. MosaicML NLP Team. 2023. Introducing mpt-7b: A new standard for open-source, commercially usable llms. Accessed: 2023-05-05.
  29. Zephyr: Direct distillation of lm alignment.
  30. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
Citations (3)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.