Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models (2402.14848v2)

Published 19 Feb 2024 in cs.CL and cs.AI

Abstract: This paper explores the impact of extending input lengths on the capabilities of LLMs. Despite LLMs advancements in recent times, their performance consistency across different input lengths is not well understood. We investigate this aspect by introducing a novel QA reasoning framework, specifically designed to assess the impact of input length. We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations. Our findings show a notable degradation in LLMs' reasoning performance at much shorter input lengths than their technical maximum. We show that the degradation trend appears in every version of our dataset, although at different intensities. Additionally, our study reveals that the traditional metric of next word prediction correlates negatively with performance of LLMs' on our reasoning dataset. We analyse our results and identify failure modes that can serve as useful guides for future research, potentially informing strategies to address the limitations observed in LLMs.

References (36)

Citations (41)

View on Semantic Scholar

Summary

The paper demonstrates that varying input lengths significantly degrade LLM reasoning performance even before reaching maximum capacity.
It employs the FLenQA dataset with controlled padding variations to isolate the effect of input length from content relevance.
Findings reveal that traditional perplexity metrics misalign with reasoning outcomes, with chain-of-thought prompting offering variable mitigation.

Impact of Input Length on LLMs' Reasoning Performance

Introduction

Recent studies have highlighted an inconsistent understanding of LLMs' (LLMs) performance over tasks involving variable input lengths. To address this gap, a novel QA reasoning framework—Flexible LENgth Question Answering dataset (FLenQA)—was introduced, aimed at examining the LLMs' reasoning capabilities across different input lengths. Through this dataset, the paper rigorously explored how padding variations, types, and locations within the same sample affect LLM reasoning performance. The findings shed light on the significant impact of input length on LLMs, revealing degradation in reasoning performance at shorter input lengths than their maximum technical thresholds and demonstrating that traditional perplexity metrics do not align with reasoning performance in long input tasks.

Data and Methodology

FLenQA encompasses three reasoning tasks, each designed to meet specific requirements: ensuring models reason over the input, isolating the length factor, and maintaining natural-looking inputs. The dataset contains 100 base instances per task, with variations created by embedding key paragraphs within longer, irrelevant texts. The resulting dataset allows for the controlled examination of model performance as a function of input length, while keeping other factors constant.

The tasks are designed to be straightforward yet challenging enough to require careful reasoning. They include Monotone Relations (MonoRel), People In Rooms (PIR), and a simplified version of Ruletaker. Each base instance is extended to various input lengths, with background text acting as padding to isolate the effect of length from other variable factors.

Results

Impact of Length and Location

The paper's results are multifaceted, revealing a notable drop in reasoning performance as input length increases, even when irrelevant text is identical to the relevant text, suggesting a length-dependent performance degradation. Surprisingly, performance drops were observed regardless of the padding's relevance, with declines evident even when only relevant tokens were added. Further analyses explored the position of key paragraphs within inputs and the nature of the irrelevant material, showing significant performance drops across all conditions.

Dissonance with Next Word Prediction Performance

An intriguing finding is the negative correlation between next-word prediction performance and reasoning accuracy on long inputs, challenging the use of perplexity as a proxy for model capability in reasoning tasks. This discrepancy underscores the distinct challenges posed by reasoning over long inputs compared to traditional next-word prediction tasks.

Chain of Thought Prompting

Investigations into the efficacy of Chain of Thought (CoT) prompting revealed it as a potent but not universally applicable technique for mitigating performance degradation due to input length. While CoT prompting generally improved performance, its effectiveness varied across models, indicating that it cannot solely address the challenges of reasoning over long inputs.

Identified Failure Modes

The analysis identified several failure modes correlated with incorrect responses. These include the tendency to bypass the question, bias towards specific answers (notably "false"), and a failure to follow the structured approach suggested by CoT prompting. Each of these failures highlights specific weaknesses in current LLMs' processing of long inputs.

Implications and Future Directions

The paper brings critical insights into the limitations of current LLMs in handling long input texts for reasoning tasks. By isolating the impact of input length, the research provides a clear demonstration of performance degradation that occurs significantly below models' maximum input lengths. These findings have crucial implications for both theoretical and practical applications of LLMs, suggesting a need for more nuanced evaluations and tailored strategies to improve performance over long inputs.

Given the observed limitations, future research directions may include the development of models and methodologies explicitly designed to handle longer inputs without compromising reasoning capabilities. Additionally, further exploration into failure modes could inform targeted improvements in model design and training processes, potentially enhancing the versatility and reliability of LLMs across a broader range of tasks and input lengths.

PDF Markdown

Related Papers

Tweets

https://twitter.com/mosh_levy/status/1762027624434401314

https://twitter.com/Alon_Jacoby/status/1780650122382049596

https://twitter.com/riley_stews/status/1783996166117720452

https://twitter.com/_akhaliq/status/1761949737723261438

https://twitter.com/yoavgo/status/1768184385029710148

https://twitter.com/scychan_brains/status/1818721115965211037