Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models (2403.03514v2)

Published 6 Mar 2024 in cs.CL

Abstract: Developing LLMs with robust long-context capabilities has been the recent research focus, resulting in the emergence of long-context LLMs proficient in Chinese. However, the evaluation of these models remains underdeveloped due to a lack of benchmarks. To address this gap, we present CLongEval, a comprehensive Chinese benchmark for evaluating long-context LLMs. CLongEval is characterized by three key features: (1) Sufficient data volume, comprising 7 distinct tasks and 7,267 examples; (2) Broad applicability, accommodating to models with context windows size from 1K to 100K; (3) High quality, with over 2,000 manually annotated question-answer pairs in addition to the automatically constructed labels. With CLongEval, we undertake a comprehensive assessment of 6 open-source long-context LLMs and 2 leading commercial counterparts that feature both long-context abilities and proficiency in Chinese. We also provide in-depth analysis based on the empirical results, trying to shed light on the critical capabilities that present challenges in long-context settings. The dataset, evaluation scripts, and model outputs are released.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088.
  2. Qwen technical report.
  3. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
  4. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
  5. bloc97. 2023. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
  6. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
  7. Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307.
  8. Kerple: Kernelized relative positional embedding for length extrapolation. Advances in Neural Information Processing Systems, 35:8386–8399.
  9. Yiming Cui. 2023. Chinese-llama-alpaca.
  10. A span-extraction dataset for chinese machine reading comprehension. arXiv preprint arXiv:1810.07366.
  11. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
  12. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
  13. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486.
  14. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137.
  15. How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210.
  16. AMERICANO: argument generation with discourse-driven decomposition and agent interaction. CoRR, abs/2310.20352.
  17. Manitweet: A new benchmark for identifying manipulation of news on social media. CoRR, abs/2305.14225.
  18. InternLMTeam. 2024. Official release of internlm2 7b and 20b base and chat models.
  19. Mistral 7b. arXiv preprint arXiv:2310.06825.
  20. Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 1(10).
  21. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
  22. Booksum: A collection of datasets for long-form narrative summarization.
  23. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626.
  24. How long can context length of open-source llms truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
  25. Text revision by on-the-fly representation optimization. pages 10956–10964.
  26. Unsupervised text generation by learning from search. Advances in Neural Information Processing Systems, 33:10820–10831.
  27. Dataset and neural recurrent sequence labeling model for open-domain factoid question answering. arXiv preprint arXiv:1607.06275.
  28. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  29. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
  30. Scaling laws of rope-based extrapolation. arXiv preprint arXiv:2310.05209.
  31. General and domain-adaptive chinese spelling check with error-consistent pretraining. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(5):1–18.
  32. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660.
  33. OpenAI. 2023. New models and developer products announced at DevDay.
  34. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048.
  35. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071.
  36. Logigan: Learning logical reasoning via adversarial pre-training. Advances in Neural Information Processing Systems, 35:16290–16304.
  37. Jean-Charles Pomerol. 1997. Artificial intelligence and human decision making. European Journal of Operational Research, 99(1):3–25.
  38. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations.
  39. A recipe for arbitrary text style transfer with large language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 837–848, Dublin, Ireland. Association for Computational Linguistics.
  40. Zeroscrolls: A zero-shot benchmark for long text understanding. arXiv preprint arXiv:2305.14196.
  41. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
  42. Thuctc: An efficient chinese text classifier.
  43. Do long-range language models actually use long-range context? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 807–822.
  44. Introduction to sighan 2015 bake-off for chinese spelling check. In Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing, pages 32–37.
  45. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453.
  46. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039.
  47. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  48. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.
  49. Memorybank: Enhancing large language models with long-term memory. arXiv preprint arXiv:2305.10250.
  50. ZhupuAI. 2023. Chatglm3 series: Open bilingual chat llms.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zexuan Qiu (8 papers)
  2. Jingjing Li (98 papers)
  3. Shijue Huang (14 papers)
  4. Wanjun Zhong (49 papers)
  5. Irwin King (170 papers)
  6. Xiaoqi Jiao (8 papers)
Citations (1)