Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
76 tokens/sec
GPT-4o
13 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs (2409.02076v7)

Published 3 Sep 2024 in cs.CL

Abstract: Current benchmarks like Needle-in-a-Haystack (NIAH), Ruler, and Needlebench focus on models' ability to understand long-context input sequences but fail to capture a critical dimension: the generation of high-quality long-form text. Applications such as design proposals, technical documentation, and creative writing rely on coherent, instruction-following outputs over extended sequences - a challenge that existing benchmarks do not adequately address. To fill this gap, we introduce LongGenBench, a novel benchmark designed to rigorously evaluate LLMs' (LLMs) ability to generate long text while adhering to complex instructions. Through tasks requiring specific events or constraints within generated text, LongGenBench evaluates model performance across four distinct scenarios, three instruction types, and two generation-lengths (16K and 32K tokens). Our evaluation of ten state-of-the-art LLMs reveals that, despite strong results on Ruler, all models struggled with long text generation on LongGenBench, particularly as text length increased. This suggests that current LLMs are not yet equipped to meet the demands of real-world, long-form text generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. L-eval: Instituting standardized evaluation for long context language models. In ICLR, 2024a.
  3. Make your llm fully utilize the context. arXiv preprint arXiv:2404.16811, 2024b.
  4. Anthropic. Introducing claude 2.1, 2024a. URL https://www.anthropic.com/index/claude-2-1. Accessed: 2024-01-23.
  5. Anthropic. Introducing the next generation of claude, 2024b. URL https://www.anthropic.com/news/claude-3-family. Accessed: 2024-03-27.
  6. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023a.
  7. Longwriter: Unleashing 10,000+ word generation from long context llms. arXiv preprint arXiv:2408.07055, 2024.
  8. Yushi Bai et al. LongBench: A bilingual, multitask benchmark for long context understanding. arXiv:2308.14508, 2023b.
  9. Introducing GoodAI LTM benchmark. Blog, 2024. URL https://www.goodai.com/introducing-goodai-ltm-benchmark/.
  10. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023.
  11. Can large language models be an alternative to human evaluations? In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  15607–15631. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL-LONG.870. URL https://doi.org/10.18653/v1/2023.acl-long.870.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  13. A dataset of information-seeking questions and answers anchored in research papers. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp.  4599–4610. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.NAACL-MAIN.365. URL https://doi.org/10.18653/v1/2021.naacl-main.365.
  14. Bamboo: A comprehensive benchmark for evaluating long text modeling capacities of large language models. arXiv:2309.13345, 2023.
  15. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  16. Hierarchical neural story generation. In Iryna Gurevych and Yusuke Miyao (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. URL https://aclanthology.org/P18-1082.
  17. ELI5: long form question answering. In Anna Korhonen, David R. Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp.  3558–3567. Association for Computational Linguistics, 2019a. doi: 10.18653/V1/P19-1346. URL https://doi.org/10.18653/v1/p19-1346.
  18. Strategies for structuring story generation. In Anna Korhonen, David R. Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp.  2650–2660. Association for Computational Linguistics, 2019b. doi: 10.18653/V1/P19-1254. URL https://doi.org/10.18653/v1/p19-1254.
  19. Non-expert evaluation of summarization systems is risky. In Chris Callison-Burch and Mark Dredze (eds.), Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp.  148–151, Los Angeles, June 2010. Association for Computational Linguistics. URL https://aclanthology.org/W10-0722.
  20. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024.
  21. PLANET: dynamic content planning in autoregressive transformers for long-form text generation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  2288–2305. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.ACL-LONG.163. URL https://doi.org/10.18653/v1/2022.acl-long.163.
  22. Xinyu Hua and Lu Wang. PAIR: planning and iterative refinement in pre-trained transformers for long text generation. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp.  781–793. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.EMNLP-MAIN.57. URL https://doi.org/10.18653/v1/2020.emnlp-main.57.
  23. Best practices for crowd-based evaluation of German summarization: Comparing crowd, expert and automatic evaluation. In Steffen Eger, Yang Gao, Maxime Peyrard, Wei Zhao, and Eduard Hovy (eds.), Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pp.  164–175, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.eval4nlp-1.16. URL https://aclanthology.org/2020.eval4nlp-1.16.
  24. Albert Q Jiang et al. Mixtral of experts. arXiv:2401.04088, 2024.
  25. Gregory Kamradt. Needle In A Haystack - pressure testing LLMs. Github, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack/tree/main.
  26. Plan ahead: Self-supervised text planning for paragraph completion task. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp.  6533–6543. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.EMNLP-MAIN.529. URL https://doi.org/10.18653/v1/2020.emnlp-main.529.
  27. Longlamp: A benchmark for personalized long-form text generation. arXiv preprint arXiv:2407.11016, 2024.
  28. Woosuk Kwon et al. Efficient memory management for large language model serving with paged attention. In Proc. of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  29. QASA: advanced question answering on scientific articles. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  19036–19052. PMLR, 2023. URL https://proceedings.mlr.press/v202/lee23n.html.
  30. How long can open-source LLMs truly promise on context length?, 2023a. URL https://lmsys.org/blog/2023-06-29-longchat.
  31. Loogle: Can long-context language models understand long contexts? arXiv:2311.04939, 2023b.
  32. Needlebench: Can llms do retrieval and reasoning in 1 million context window? arXiv preprint arXiv:2407.11963, 2024.
  33. Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762, 2023c.
  34. Lost in the middle: How language models use long contexts. Transactions of the ACL, 12:157–173, 2024a.
  35. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp.  2511–2522. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.emnlp-main.153.
  36. Aligning with human judgement: The role of pairwise preference in large language model evaluators. arXiv preprint arXiv:2403.16950, 2024b.
  37. Mistral.AI. La plateforme, 2023. URL https://mistral.ai/news/la-plateforme/.
  38. Landmark attention: Random-access infinite context length for Transformers. In Workshop on Efficient Systems for Foundation Models @ ICML, 2023.
  39. OpenAI. ChatGPT, 2022. URL https://chat.openai.com.
  40. OpenAI. GPT-4 Technical Report. CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  41. OpenAI. Gpt-4o mini: Advancing cost-efficient intelligence, 2024a. URL https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/. Accessed: 2024-08-31.
  42. OpenAI. Hello gpt-4o, 2024b. URL https://openai.com/index/hello-gpt-4o/. Accessed: 2024-08-31.
  43. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  44. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  45. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483, 2023.
  46. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  47. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  48. ZeroSCROLLS: A zero-shot benchmark for long text understanding. In EMNLP, 2023.
  49. Large language models are not yet human-level evaluators for abstractive summarization. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pp.  4215–4233. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.findings-emnlp.278.
  50. ASQA: factoid questions meet long-form answers. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp.  8273–8288. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.EMNLP-MAIN.566. URL https://doi.org/10.18653/v1/2022.emnlp-main.566.
  51. ChapterBreak: A challenge dataset for long-range language models. In Proc. of the 2022 Conference of the North American Chapter of the ACL: Human Language Technologies, 2022.
  52. Proxyqa: An alternative framework for evaluating long-form text generation with large language models. arXiv preprint arXiv:2401.15042, 2024.
  53. A benchmark for learning to translate a new language from one grammar book. In ICLR, 2024.
  54. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  55. Yi Tay et al. Long Range Arena: A benchmark for efficient Transformers. In ICLR, 2021.
  56. Large language models are not fair evaluators. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  9440–9450, Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.511.
  57. Thomas Wolf et al. Huggingface’s Transformers: State-of-the-art natural language processing. arXiv:1910.03771, 2019.
  58. SentiStream: A co-training framework for adaptive online sentiment analysis in evolving data streams. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  6198–6212, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.380. URL https://aclanthology.org/2023.emnlp-main.380.
  59. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023. URL https://arxiv.org/abs/2309.16039.
  60. Beyond goldfish memory: Long-term open-domain conversation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  5180–5197. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.ACL-LONG.356. URL https://doi.org/10.18653/v1/2022.acl-long.356.
  61. MEGATRON-CNTRL: controllable story generation with external knowledge using large-scale language models. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp.  2831–2845. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.EMNLP-MAIN.226. URL https://doi.org/10.18653/v1/2020.emnlp-main.226.
  62. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
  63. Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862, 2023.
  64. ∞\infty∞bench: Extending long context evaluation beyond 100k tokens. arXiv:2402.13718, 2024.
  65. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=uccHPGDlao.
  66. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.