PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning (2507.06415v1)

Published 8 Jul 2025 in cs.CL and cs.LG

Abstract: Long-context reasoning requires accurately identifying relevant information in extensive, noisy input contexts. Previous research shows that using test-time learning to encode context directly into model parameters can effectively enable reasoning over noisy information. However, meta-learning methods for enabling test-time learning are prohibitively memory-intensive, preventing their application to long context settings. In this work, we propose PERK (Parameter Efficient Reasoning over Knowledge), a scalable approach for learning to encode long input contexts using gradient updates to a lightweight model adapter at test time. Specifically, PERK employs two nested optimization loops in a meta-training phase. The inner loop rapidly encodes contexts into a low-rank adapter (LoRA) that serves as a parameter-efficient memory module for the base model. Concurrently, the outer loop learns to use the updated adapter to accurately recall and reason over relevant information from the encoded long context. Our evaluations on several long-context reasoning tasks show that PERK significantly outperforms the standard prompt-based long-context baseline, achieving average absolute performance gains of up to 90% for smaller models (GPT-2) and up to 27% for our largest evaluated model, Qwen-2.5-0.5B. In general, PERK is more robust to reasoning complexity, length extrapolation, and the locations of relevant information in contexts. Finally, we show that while PERK is memory-intensive during training, it scales more efficiently at inference time than prompt-based long-context inference.

Summary

The paper introduces PERK, a meta-learning framework that reframes long-context reasoning as a test-time learning problem using lightweight, gradient-updated adapters.
It employs a low-rank LoRA adapter that is updated during both meta-training and test-time adaptation, significantly reducing computational overhead compared to full-model finetuning.
Experimental results show up to 60% improvement on needle-in-a-haystack tasks and only a 5% accuracy drop when scaling from 8K to 32K tokens, demonstrating robust performance on lengthy contexts.

PERK: Parameter-Efficient Test-Time Learning for Long-Context Reasoning

The paper introduces PERK, a meta-learning framework for long-context reasoning that leverages parameter-efficient test-time adaptation. The approach addresses the persistent challenge in LLMs: robust reasoning over extensive, noisy contexts, where relevant information may be sparsely distributed and obscured by distractors. PERK reframes long-context reasoning as a test-time learning problem, encoding context into a lightweight adapter via gradient-based updates, rather than relying solely on prompt-based in-context learning or full-model finetuning.

Methodology

PERK builds on the test-time learning paradigm, where a model adapts its parameters at inference by processing the input context, thus internalizing relevant information for subsequent reasoning. The key innovation is the use of a low-rank adapter (LoRA) as a parameter-efficient memory module. Only the adapter parameters are updated during both meta-training and test-time adaptation, while the base model remains frozen. This design significantly reduces the memory and computational overhead associated with meta-learning approaches such as MAML, which are otherwise prohibitive for long-context and large-model settings.

The meta-training procedure consists of two nested optimization loops:

Inner Loop: The model encodes batches of context segments into the LoRA adapter via gradient descent, simulating the test-time adaptation process.
Outer Loop: The meta-parameters of the adapter are optimized to minimize the reasoning loss on downstream tasks, backpropagating through the inner loop. To further reduce memory usage, PERK employs truncated gradient unrolling, retaining the computation graph only for the final few inner-loop steps.

This approach enables PERK to scale to longer contexts and larger models than previous test-time learning methods, while maintaining parameter efficiency.

Experimental Results

PERK is evaluated on several long-context reasoning benchmarks, including:

Needle-in-a-Haystack (NIAH): Tasks where relevant facts are embedded among large amounts of distractor text.
Drops-in-the-Ocean (DIO): A more challenging setting introduced in this work, where relevant and distractor facts are distributionally similar, simulating real-world scenarios where relevant information is not easily distinguishable.
API Retrieval: Tasks requiring retrieval of relevant information from long, structured documents.

Key empirical findings:

Substantial Performance Gains: On NIAH, PERK achieves up to 60% absolute improvement over finetuned in-context reasoning (FT-ICR) with GPT-2, and up to 27% with Qwen-2.5-0.5B. On DIO, PERK consistently outperforms all baselines, including larger models.
Robustness to Reasoning Complexity and Context Length: PERK maintains high accuracy as the number of reasoning hops and context length increase, with only minor degradation even at 8K tokens and beyond.
Length Extrapolation: PERK generalizes to contexts up to 128K tokens, far exceeding the training context length, with significantly less performance drop than FT-ICR. For example, when trained on 8K tokens and tested on 32K, PERK's accuracy drops by only 5%, compared to 32% for FT-ICR.
Positional Robustness: Unlike FT-ICR, which is highly sensitive to the position of relevant information, PERK generalizes across contexts where relevant facts appear at arbitrary positions.
Efficiency: PERK scales more efficiently in both memory and runtime at inference, especially for extremely long contexts. Through gradient accumulation, memory usage can be traded for runtime, enabling practical deployment on hardware with limited resources.

Implementation Considerations

Adapter Design: PERK uses LoRA adapters with tunable rank (e.g., 16 or 256), balancing parameter efficiency and performance. Ablations show that even with low-rank adapters, PERK outperforms full-model finetuning baselines.

Inner Loop Steps: The number of inner loop adaptation steps is a critical hyperparameter. More steps generally improve performance, but increase memory and compute requirements. Truncated gradient unrolling allows for practical training with longer adaptation trajectories.

Batching and Context Segmentation: Long contexts are split into batches of shorter segments for encoding, enabling processing of sequences far beyond the model's native context window.

Hardware Requirements: Training PERK is memory-intensive, but inference is highly efficient. The method is validated on a single Nvidia H100 GPU, but the parameter-efficient design makes it amenable to deployment on more modest hardware, especially at inference.

Theoretical and Practical Implications

PERK demonstrates that parameter-efficient test-time learning is a viable and scalable alternative to prompt-based or full-model adaptation for long-context reasoning. By encoding context into a lightweight adapter, the model can robustly retrieve and reason over relevant information, even in the presence of substantial noise and distributional similarity between relevant and irrelevant facts.

The approach challenges the prevailing reliance on ever-larger context windows and attention mechanisms, suggesting that parametric memory—when efficiently and adaptively updated—can yield superior long-context reasoning capabilities. The robustness to positional shifts and length extrapolation further indicates that PERK's adaptation mechanism internalizes context in a manner less susceptible to the limitations of positional encoding and attention span.

Future Directions

Several avenues for further research and application are suggested:

Adapter Architectures: Exploring alternative or more expressive adapter designs could further improve performance or efficiency.
Task Generalization: Extending PERK to a broader range of reasoning tasks, including those requiring multi-modal or structured input.
Continual and Lifelong Learning: Integrating PERK with continual learning frameworks to enable models that can adapt to evolving knowledge bases over time.
Deployment in Resource-Constrained Environments: Leveraging PERK's inference efficiency for on-device or edge deployment in applications requiring long-context understanding (e.g., document analysis, legal or medical records processing).

Conclusion

PERK establishes a new paradigm for long-context reasoning in LLMs, combining meta-learning with parameter-efficient test-time adaptation. The method achieves strong empirical results across multiple benchmarks, demonstrating superior performance, generalization, and efficiency compared to established baselines. Its design principles and empirical findings have significant implications for the development of scalable, robust, and efficient long-context LLMs.