Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models (2402.14848v2)
Abstract: This paper explores the impact of extending input lengths on the capabilities of LLMs. Despite LLMs advancements in recent times, their performance consistency across different input lengths is not well understood. We investigate this aspect by introducing a novel QA reasoning framework, specifically designed to assess the impact of input length. We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations. Our findings show a notable degradation in LLMs' reasoning performance at much shorter input lengths than their technical maximum. We show that the degradation trend appears in every version of our dataset, although at different intensities. Additionally, our study reveals that the traditional metric of next word prediction correlates negatively with performance of LLMs' on our reasoning dataset. We analyse our results and identify failure modes that can serve as useful guides for future research, potentially informing strategies to address the limitations observed in LLMs.
- L-eval: Instituting standardized evaluation for long context language models. ArXiv, abs/2307.11088.
- L-eval: Instituting standardized evaluation for long context language models.
- Gemini: A family of highly capable multimodal models. ArXiv, abs/2312.11805.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
- Jifan Chen and Greg Durrett. 2019. Understanding dataset design choices for multi-hop reasoning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
- Transformers as soft reasoners over language. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 3882–3890.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Language models show human-like content effects on reasoning. arXiv preprint arXiv:2207.07051.
- Daniele Faraglia and Other Contributors. 2012. Faker.
- Alexios Gidiotis and Grigorios Tsoumakas. 2020. A divide-and-conquer approach to the summarization of long documents. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:3029–3040.
- Generative models as a complex systems science: How can we make sense of large language model behavior? arXiv preprint arXiv:2308.00189.
- Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403.
- Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5075–5084, Singapore. Association for Computational Linguistics.
- Mixtral of experts.
- The impact of reasoning step length on large language models. arXiv preprint arXiv:2401.04925.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- Guiding llm to fool itself: Automatically manipulating machine reading comprehension shortcut triggers. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8495–8505.
- Loogle: Can long-context language models understand long contexts? ArXiv, abs/2311.04939.
- Same pre-training loss, better downstream: Implicit bias matters for language models. In International Conference on Machine Learning, pages 22188–22214. PMLR.
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
- End-to-end segmentation-based news summarization. In Findings of the Association for Computational Linguistics: ACL 2022, pages 544–554.
- Compositional questions do not necessitate multi-hop reasoning. In Annual Meeting of the Association for Computational Linguistics.
- OpenAI. 2023. Gpt-4 technical report.
- Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787.
- Zeroscrolls: A zero-shot benchmark for long text understanding. arXiv preprint arXiv:2305.14196.
- Scrolls: Standardized comparison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 12007–12021.
- Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210–31227. PMLR.
- Compositional language understanding with text-based relational reasoning. ArXiv, abs/1811.02959.
- Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Towards ai-complete question answering: A set of prerequisite toy tasks. In 4th International Conference on Learning Representations, ICLR 2016.
- How “multi” is multi-document summarization? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5761–5769.
- Training trajectories of language models across scales. arXiv preprint arXiv:2212.09803.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
- Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV).