ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models (2403.20262v2)
Abstract: Research on LLMs has recently witnessed an increasing interest in extending models' context size to better capture dependencies within long documents. While benchmarks have been proposed to assess long-range abilities, existing efforts primarily considered generic tasks that are not necessarily aligned with real-world applications. In contrast, our work proposes a new benchmark for long-context LLMs focused on a practical meeting assistant scenario. In this scenario, the long contexts consist of transcripts obtained by automatic speech recognition, presenting unique challenges for LLMs due to the inherent noisiness and oral nature of such data. Our benchmark, named ELITR-Bench, augments the existing ELITR corpus' transcripts with 271 manually crafted questions and their ground-truth answers. Our experiments with recent long-context LLMs on ELITR-Bench highlight a gap between open-source and proprietary models, especially when questions are asked sequentially within a conversation. We also provide a thorough analysis of our GPT-4-based evaluation method, encompassing insights from a crowdsourcing study. Our findings suggest that while GPT-4's evaluation scores are correlated with human judges', its ability to differentiate among more than three score levels may be limited.
- L-eval: Instituting standardized evaluation for long context language models, 2023.
- Longbench: A bilingual, multitask benchmark for long context understanding, 2023.
- Longalign: A recipe for long context alignment of large language models, 2024.
- Longformer: The long-document transformer. CoRR, abs/2004.05150, 2020.
- Scaling transformer to 1m tokens and beyond with rmt, 2024.
- Extending context window of large language models via positional interpolation, 2023.
- Longlora: Efficient fine-tuning of long-context large language models, 2024.
- Adapting language models to compress contexts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 3829–3846, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/, 2023.
- Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019.
- Rethinking attention with performers. In International Conference on Learning Representations, 2021.
- Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy, 2019. Association for Computational Linguistics.
- Bamboo: A comprehensive benchmark for evaluating long text modeling capacities of large language models, 2024.
- Mamba: Linear-time sequence modeling with selective state spaces, 2023.
- If in a crowdsourced data annotation pipeline, a gpt-4. arXiv preprint arXiv:2402.16795, 2024.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, 2020.
- Hierarchical neural network approaches for long document classification. CoRR, abs/2201.06774, 2022.
- Prometheus: Inducing evaluation capability in language models. In The Twelfth International Conference on Learning Representations, 2024.
- The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018.
- A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of chiropractic medicine, 15(2):155–163, 2016.
- Same task, more tokens: the impact of input length on the reasoning performance of large language models, 2024.
- How long can open-source llms truly promise on context length? https://lmsys.org/blog/2023-06-29-longchat/, 2023a.
- Loogle: Can long-context language models understand long contexts?, 2023b.
- Ring attention with blockwise transformers for near-infinite context, 2023a.
- Lost in the middle: How language models use long contexts, 2023b.
- Ernie-sparse: Learning hierarchical efficient transformer through regularized self-attention, 2022.
- Evaluating very long-term conversational memory of llm agents, 2024.
- Sparse and continuous attention mechanisms. In Advances in Neural Information Processing Systems, pages 20989–21001. Curran Associates, Inc., 2020.
- ELITR minuting corpus: A novel dataset for automatic minuting from multi-party meetings in English and Czech. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3174–3182, Marseille, France, 2022. European Language Resources Association.
- OpenAI. GPT-4 Technical Report. pages 1–100, 2023.
- Giraffe: Adventures in expanding context lengths in llms, 2023.
- RWKV: Reinventing RNNs for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14048–14077, Singapore, 2023. Association for Computational Linguistics.
- YaRN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations, 2021.
- Mambabyte: Token-free selective state space model, 2024.
- Linformer: Self-attention with linear complexity. CoRR, abs/2006.04768, 2020.
- Bernard L. Welch. The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika, 34(1/2):28–35, 1947.
- Hi-transformer: Hierarchical interactive transformer for efficient and effective long document modeling. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 848–853, Online, 2021. Association for Computational Linguistics.
- Effective long-context scaling of foundation models, 2023.
- Retrieval meets long context large language models. In The Twelfth International Conference on Learning Representations, 2024.
- Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems, pages 17283–17297. Curran Associates, Inc., 2020.
- ∞\infty∞bench: Extending long context evaluation beyond 100k tokens, 2024.
- Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- Thibaut Thonet (8 papers)
- Jos Rozen (11 papers)
- Laurent Besacier (76 papers)