Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Long Context Alignment with Short Instructions and Synthesized Positions (2405.03939v1)

Published 7 May 2024 in cs.CL

Abstract: Effectively handling instructions with extremely long context remains a challenge for LLMs, typically necessitating high-quality long data and substantial computational resources. This paper introduces Step-Skipping Alignment (SkipAlign), a new technique designed to enhance the long-context capabilities of LLMs in the phase of alignment without the need for additional efforts beyond training with original data length. SkipAlign is developed on the premise that long-range dependencies are fundamental to enhancing an LLM's capacity of long context. Departing from merely expanding the length of input samples, SkipAlign synthesizes long-range dependencies from the aspect of positions indices. This is achieved by the strategic insertion of skipped positions within instruction-following samples, which utilizes the semantic structure of the data to effectively expand the context. Through extensive experiments on base models with a variety of context window sizes, SkipAlign demonstrates its effectiveness across a spectrum of long-context tasks. Particularly noteworthy is that with a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Yi: Open foundation models by 01.ai, 2024.
  2. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale, 2022.
  3. L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088, 2023.
  4. Longbench: A bilingual, multitask benchmark for long context understanding, 2023.
  5. Longalign: A recipe for long context alignment of large language models. arXiv preprint arXiv:2401.18058, 2024.
  6. Peek across: Improving multi-document modeling via cross-document question-answering. arXiv preprint arXiv:2305.15387, 2023.
  7. Evaluating large language models trained on code. ArXiv, abs/2107.03374, 2021. URL https://api.semanticscholar.org/CorpusID:235755472.
  8. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023a.
  9. Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307, 2023b.
  10. Tydi qa: A benchmark for information-seeking question answering in ty pologically di verse languages. Transactions of the Association for Computational Linguistics, 8:454–470, 2020.
  11. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171, 2024.
  12. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
  13. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
  14. Llm maybe longlm: Self-extend llm context window without tuning, 2024.
  15. How long can open-source llms truly promise on context length?, June 2023. URL https://lmsys.org/blog/2023-06-29-longchat.
  16. Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300, 2023.
  17. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have, 2023.
  18. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  19. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  20. Randomized positional encodings boost length generalization of transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  1889–1903, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.161.
  21. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  22. Challenging big-bench tasks and whether chain-of-thought can solve them. In Annual Meeting of the Association for Computational Linguistics, 2022. URL https://api.semanticscholar.org/CorpusID:252917648.
  23. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  24. Focused transformer: Contrastive training for context scaling. Advances in Neural Information Processing Systems, 36, 2024.
  25. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017a. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  26. Attention is all you need. Advances in neural information processing systems, 30, 2017b.
  27. How far can camels go? exploring the state of instruction tuning on open resources, 2023.
  28. Finetuned language models are zero-shot learners, 2022.
  29. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
  30. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
  31. Pose: Efficient context window extension of llms via positional skip-wise training. arXiv preprint arXiv:2309.10400, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Wenhao Wu (71 papers)
  2. Yizhong Wang (42 papers)
  3. Yao Fu (83 papers)
  4. Xiang Yue (72 papers)
  5. Dawei Zhu (46 papers)
  6. Sujian Li (82 papers)
Citations (11)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets