Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding (2403.04797v1)

Published 5 Mar 2024 in cs.CL and cs.LG

Abstract: This paper aims to overcome the "lost-in-the-middle" challenge of LLMs. While recent advancements have successfully enabled LLMs to perform stable LLMing with up to 4 million tokens, the persistent difficulty faced by most LLMs in identifying relevant information situated in the middle of the context has not been adequately tackled. To address this problem, this paper introduces Multi-scale Positional Encoding (Ms-PoE) which is a simple yet effective plug-and-play approach to enhance the capacity of LLMs to handle the relevant information located in the middle of the context, without fine-tuning or introducing any additional overhead. Ms-PoE leverages the position indice rescaling to relieve the long-term decay effect introduced by RoPE, while meticulously assigning distinct scaling ratios to different attention heads to preserve essential knowledge learned during the pre-training step, forming a multi-scale context fusion from short to long distance. Extensive experiments with a wide range of LLMs demonstrate the efficacy of our approach. Notably, Ms-PoE achieves an average accuracy gain of up to 3.8 on the Zero-SCROLLS benchmark over the original LLMs. Code are available at https://github.com/VITA-Group/Ms-PoE.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Clex: Continuous length extrapolation for large language models. arXiv preprint arXiv:2310.16450, 2023a.
  2. Walking down the memory maze: Beyond context limit through interactive reading. arXiv preprint arXiv:2310.05029, 2023b.
  3. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023c.
  4. Fortify the shortest stave in attention: Enhancing context awareness of large language models for effective tool use. arXiv preprint arXiv:2312.04455, 2023d.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  6. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
  7. Dao, T. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861, 2023.
  10. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pp.  5793–5831. PMLR, 2022.
  11. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023.
  12. Segnext: Rethinking convolutional attention design for semantic segmentation. Advances in Neural Information Processing Systems, 35:1140–1156, 2022.
  13. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
  14. Efficient long-text understanding with short-text models. Transactions of the Association for Computational Linguistics, 11:284–299, 2023.
  15. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023.
  16. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023a.
  17. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  18. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839, 2023b.
  19. Llm maybe longlm: Self-extend llm context window without tuning. arXiv preprint arXiv:2401.01325, 2024.
  20. Never lost in the middle: Improving large language models via attention strengthening question answering. arXiv preprint arXiv:2311.09198, 2023.
  21. Booksum: A collection of datasets for long-form narrative summarization. arXiv preprint arXiv:2105.08209, 2021.
  22. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
  23. Loogle: Can long-context language models understand long contexts? arXiv preprint arXiv:2311.04939, 2023.
  24. On the expressive power of self-attention matrices. arXiv preprint arXiv:2106.03764, 2021.
  25. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  26. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
  27. Stable beluga models. URL [https://huggingface.co/stabilityai/StableBeluga2](https://huggingface.co/stabilityai/StableBeluga2).
  28. Transformers are multi-state rnns. arXiv preprint arXiv:2401.06104, 2024.
  29. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/2306.01116.
  30. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  31. Attention sorting combats recency bias in long context language models. arXiv preprint arXiv:2310.01427, 2023.
  32. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  33. Improving language understanding by generative pre-training. 2018.
  34. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  35. Can longer sequences help take the next leap in ai?, June 2022. URL https://hazyresearch.stanford.edu/blog/2022-06-09-longer-sequences-next-leap-ai. Accessed: 2024-01-29.
  36. Zeroscrolls: A zero-shot benchmark for long text understanding. arXiv preprint arXiv:2305.14196, 2023.
  37. Zebra: Extending context window with layerwise grouped local-global attention. arXiv preprint arXiv:2312.08618, 2023a.
  38. Deepspeed4science initiative: Enabling large-scale scientific discovery through sophisticated ai system technologies. arXiv preprint arXiv:2310.04610, 2023b.
  39. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  40. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1–9, 2015.
  41. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2818–2826, 2016.
  42. Long range arena: A benchmark for efficient transformers. In International Conference on Learning Representations, 2020.
  43. Team, M. N. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05.
  44. Joma: Demystifying multilayer transformers via joint dynamics of mlp and attention. arXiv preprint arXiv:2310.00535, 2023.
  45. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  46. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
  47. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  48. Augmenting language models with long-term memory. arXiv preprint arXiv:2306.07174, 2023.
  49. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
  50. Retrieval meets long context large language models. arXiv preprint arXiv:2310.03025, 2023.
  51. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity, 2023.
  52. Soaring from 4k to 400k: Extending llm’s context with activation beacon. arXiv preprint arXiv:2401.03462, 2024.
  53. Summ^ n: A multi-stage summarization framework for long input dialogues and documents. arXiv preprint arXiv:2110.10150, 2021.
  54. H _⁢2_2\_2_ 2 o: Heavy-hitter oracle for efficient generative inference of large language models. arXiv preprint arXiv:2306.14048, 2023.
  55. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. In KDD, 2023.
  56. Dialoglm: Pre-trained model for long dialogue understanding and summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  11765–11773, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zhenyu Zhang (249 papers)
  2. Runjin Chen (11 papers)
  3. Shiwei Liu (75 papers)
  4. Zhewei Yao (64 papers)
  5. Olatunji Ruwase (20 papers)
  6. Beidi Chen (61 papers)
  7. Xiaoxia Wu (30 papers)
  8. Zhangyang Wang (374 papers)
Citations (18)