CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models (2403.03514v2)
Abstract: Developing LLMs with robust long-context capabilities has been the recent research focus, resulting in the emergence of long-context LLMs proficient in Chinese. However, the evaluation of these models remains underdeveloped due to a lack of benchmarks. To address this gap, we present CLongEval, a comprehensive Chinese benchmark for evaluating long-context LLMs. CLongEval is characterized by three key features: (1) Sufficient data volume, comprising 7 distinct tasks and 7,267 examples; (2) Broad applicability, accommodating to models with context windows size from 1K to 100K; (3) High quality, with over 2,000 manually annotated question-answer pairs in addition to the automatically constructed labels. With CLongEval, we undertake a comprehensive assessment of 6 open-source long-context LLMs and 2 leading commercial counterparts that feature both long-context abilities and proficiency in Chinese. We also provide in-depth analysis based on the empirical results, trying to shed light on the critical capabilities that present challenges in long-context settings. The dataset, evaluation scripts, and model outputs are released.
- L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088.
- Qwen technical report.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
- bloc97. 2023. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
- Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307.
- Kerple: Kernelized relative positional embedding for length extrapolation. Advances in Neural Information Processing Systems, 35:8386–8399.
- Yiming Cui. 2023. Chinese-llama-alpaca.
- A span-extraction dataset for chinese machine reading comprehension. arXiv preprint arXiv:1810.07366.
- Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
- Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486.
- Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137.
- How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210.
- AMERICANO: argument generation with discourse-driven decomposition and agent interaction. CoRR, abs/2310.20352.
- Manitweet: A new benchmark for identifying manipulation of news on social media. CoRR, abs/2305.14225.
- InternLMTeam. 2024. Official release of internlm2 7b and 20b base and chat models.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 1(10).
- The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
- Booksum: A collection of datasets for long-form narrative summarization.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626.
- How long can context length of open-source llms truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
- Text revision by on-the-fly representation optimization. pages 10956–10964.
- Unsupervised text generation by learning from search. Advances in Neural Information Processing Systems, 33:10820–10831.
- Dataset and neural recurrent sequence labeling model for open-domain factoid question answering. arXiv preprint arXiv:1607.06275.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
- Scaling laws of rope-based extrapolation. arXiv preprint arXiv:2310.05209.
- General and domain-adaptive chinese spelling check with error-consistent pretraining. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(5):1–18.
- Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660.
- OpenAI. 2023. New models and developer products announced at DevDay.
- Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048.
- Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071.
- Logigan: Learning logical reasoning via adversarial pre-training. Advances in Neural Information Processing Systems, 35:16290–16304.
- Jean-Charles Pomerol. 1997. Artificial intelligence and human decision making. European Journal of Operational Research, 99(1):3–25.
- Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations.
- A recipe for arbitrary text style transfer with large language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 837–848, Dublin, Ireland. Association for Computational Linguistics.
- Zeroscrolls: A zero-shot benchmark for long text understanding. arXiv preprint arXiv:2305.14196.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
- Thuctc: An efficient chinese text classifier.
- Do long-range language models actually use long-range context? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 807–822.
- Introduction to sighan 2015 bake-off for chinese spelling check. In Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing, pages 32–37.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453.
- Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039.
- Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.
- Memorybank: Enhancing large language models with long-term memory. arXiv preprint arXiv:2305.10250.
- ZhupuAI. 2023. Chatglm3 series: Open bilingual chat llms.
- Zexuan Qiu (8 papers)
- Jingjing Li (98 papers)
- Shijue Huang (14 papers)
- Wanjun Zhong (49 papers)
- Irwin King (170 papers)
- Xiaoqi Jiao (8 papers)