Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models (2402.14848v2)

Published 19 Feb 2024 in cs.CL and cs.AI

Abstract: This paper explores the impact of extending input lengths on the capabilities of LLMs. Despite LLMs advancements in recent times, their performance consistency across different input lengths is not well understood. We investigate this aspect by introducing a novel QA reasoning framework, specifically designed to assess the impact of input length. We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations. Our findings show a notable degradation in LLMs' reasoning performance at much shorter input lengths than their technical maximum. We show that the degradation trend appears in every version of our dataset, although at different intensities. Additionally, our study reveals that the traditional metric of next word prediction correlates negatively with performance of LLMs' on our reasoning dataset. We analyse our results and identify failure modes that can serve as useful guides for future research, potentially informing strategies to address the limitations observed in LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. L-eval: Instituting standardized evaluation for long context language models. ArXiv, abs/2307.11088.
  2. L-eval: Instituting standardized evaluation for long context language models.
  3. Gemini: A family of highly capable multimodal models. ArXiv, abs/2312.11805.
  4. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
  5. Jifan Chen and Greg Durrett. 2019. Understanding dataset design choices for multi-hop reasoning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
  6. Transformers as soft reasoners over language. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 3882–3890.
  7. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  8. Language models show human-like content effects on reasoning. arXiv preprint arXiv:2207.07051.
  9. Daniele Faraglia and Other Contributors. 2012. Faker.
  10. Alexios Gidiotis and Grigorios Tsoumakas. 2020. A divide-and-conquer approach to the summarization of long documents. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:3029–3040.
  11. Generative models as a complex systems science: How can we make sense of large language model behavior? arXiv preprint arXiv:2308.00189.
  12. Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403.
  13. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5075–5084, Singapore. Association for Computational Linguistics.
  14. Mixtral of experts.
  15. The impact of reasoning step length on large language models. arXiv preprint arXiv:2401.04925.
  16. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  17. Guiding llm to fool itself: Automatically manipulating machine reading comprehension shortcut triggers. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8495–8505.
  18. Loogle: Can long-context language models understand long contexts? ArXiv, abs/2311.04939.
  19. Same pre-training loss, better downstream: Implicit bias matters for language models. In International Conference on Machine Learning, pages 22188–22214. PMLR.
  20. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
  21. End-to-end segmentation-based news summarization. In Findings of the Association for Computational Linguistics: ACL 2022, pages 544–554.
  22. Compositional questions do not necessitate multi-hop reasoning. In Annual Meeting of the Association for Computational Linguistics.
  23. OpenAI. 2023. Gpt-4 technical report.
  24. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787.
  25. Zeroscrolls: A zero-shot benchmark for long text understanding. arXiv preprint arXiv:2305.14196.
  26. Scrolls: Standardized comparison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 12007–12021.
  27. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210–31227. PMLR.
  28. Compositional language understanding with text-based relational reasoning. ArXiv, abs/1811.02959.
  29. Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551.
  30. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  31. Towards ai-complete question answering: A set of prerequisite toy tasks. In 4th International Conference on Learning Representations, ICLR 2016.
  32. How “multi” is multi-document summarization? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5761–5769.
  33. Training trajectories of language models across scales. arXiv preprint arXiv:2212.09803.
  34. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
  35. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910.
  36. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV).
Citations (41)

Summary

  • The paper demonstrates that varying input lengths significantly degrade LLM reasoning performance even before reaching maximum capacity.
  • It employs the FLenQA dataset with controlled padding variations to isolate the effect of input length from content relevance.
  • Findings reveal that traditional perplexity metrics misalign with reasoning outcomes, with chain-of-thought prompting offering variable mitigation.

Impact of Input Length on LLMs' Reasoning Performance

Introduction

Recent studies have highlighted an inconsistent understanding of LLMs' (LLMs) performance over tasks involving variable input lengths. To address this gap, a novel QA reasoning framework—Flexible LENgth Question Answering dataset (FLenQA)—was introduced, aimed at examining the LLMs' reasoning capabilities across different input lengths. Through this dataset, the paper rigorously explored how padding variations, types, and locations within the same sample affect LLM reasoning performance. The findings shed light on the significant impact of input length on LLMs, revealing degradation in reasoning performance at shorter input lengths than their maximum technical thresholds and demonstrating that traditional perplexity metrics do not align with reasoning performance in long input tasks.

Data and Methodology

FLenQA encompasses three reasoning tasks, each designed to meet specific requirements: ensuring models reason over the input, isolating the length factor, and maintaining natural-looking inputs. The dataset contains 100 base instances per task, with variations created by embedding key paragraphs within longer, irrelevant texts. The resulting dataset allows for the controlled examination of model performance as a function of input length, while keeping other factors constant.

The tasks are designed to be straightforward yet challenging enough to require careful reasoning. They include Monotone Relations (MonoRel), People In Rooms (PIR), and a simplified version of Ruletaker. Each base instance is extended to various input lengths, with background text acting as padding to isolate the effect of length from other variable factors.

Results

Impact of Length and Location

The paper's results are multifaceted, revealing a notable drop in reasoning performance as input length increases, even when irrelevant text is identical to the relevant text, suggesting a length-dependent performance degradation. Surprisingly, performance drops were observed regardless of the padding's relevance, with declines evident even when only relevant tokens were added. Further analyses explored the position of key paragraphs within inputs and the nature of the irrelevant material, showing significant performance drops across all conditions.

Dissonance with Next Word Prediction Performance

An intriguing finding is the negative correlation between next-word prediction performance and reasoning accuracy on long inputs, challenging the use of perplexity as a proxy for model capability in reasoning tasks. This discrepancy underscores the distinct challenges posed by reasoning over long inputs compared to traditional next-word prediction tasks.

Chain of Thought Prompting

Investigations into the efficacy of Chain of Thought (CoT) prompting revealed it as a potent but not universally applicable technique for mitigating performance degradation due to input length. While CoT prompting generally improved performance, its effectiveness varied across models, indicating that it cannot solely address the challenges of reasoning over long inputs.

Identified Failure Modes

The analysis identified several failure modes correlated with incorrect responses. These include the tendency to bypass the question, bias towards specific answers (notably "false"), and a failure to follow the structured approach suggested by CoT prompting. Each of these failures highlights specific weaknesses in current LLMs' processing of long inputs.

Implications and Future Directions

The paper brings critical insights into the limitations of current LLMs in handling long input texts for reasoning tasks. By isolating the impact of input length, the research provides a clear demonstration of performance degradation that occurs significantly below models' maximum input lengths. These findings have crucial implications for both theoretical and practical applications of LLMs, suggesting a need for more nuanced evaluations and tailored strategies to improve performance over long inputs.

Given the observed limitations, future research directions may include the development of models and methodologies explicitly designed to handle longer inputs without compromising reasoning capabilities. Additionally, further exploration into failure modes could inform targeted improvements in model design and training processes, potentially enhancing the versatility and reliability of LLMs across a broader range of tasks and input lengths.

Youtube Logo Streamline Icon: https://streamlinehq.com