LOGO -- Long cOntext aliGnment via efficient preference Optimization (2410.18533v1)
Abstract: Long-context models(LCMs) have shown great potential in processing long input sequences(even more than 100M tokens) conveniently and effectively. With significant progress, recent research has pointed out that LCMs can accurately locate token-level salient information within the context. Yet, the generation performance of these LCMs is far from satisfactory and might result in misaligned responses, such as hallucinations. To enhance the generation capability of LCMs, existing works have investigated the effects of data size and quality for both pre-training and instruction tuning. Though achieving meaningful improvement, previous methods fall short in either effectiveness or efficiency. In this paper, we introduce LOGO(Long cOntext aliGnment via efficient preference Optimization), a training strategy that first introduces preference optimization for long-context alignment. To overcome the GPU memory-bound issue caused by the long sequence, LOGO employs a reference-free preference optimization strategy and adopts a position synthesis method to construct the training data. By training with only 0.3B data on a single 8$\times$A800 GPU machine for 16 hours, LOGO allows the Llama-3-8B-Instruct-80K model to achieve comparable performance with GPT-4 in real-world long-context tasks while preserving the model's original capabilities on other tasks, e.g., LLMing and MMLU. Moreover, LOGO can extend the model's context window size while enhancing its generation performance.
- Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- AI@Meta. Llama 3-1 model card. Blob, 2024a. URL https://ai.meta.com/blog/meta-llama-3-1/.
- AI@Meta. Llama 3 model card. Blob, 2024b. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
- Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE, 2022.
- anthropic. Claude-3-5-sonnet model card. blog, 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
- Longalign: A recipe for long context alignment of large language models. arXiv preprint arXiv:2401.18058, 2024.
- Luna: An evaluation foundation model to catch language model hallucinations with high accuracy and low cost. arXiv preprint arXiv:2406.00975, 2024.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023a.
- Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307, 2023b.
- Long alpaca: Long-context instruction-following models. https://github.com/dvlab-research/LongLoRA, 2023c.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Together Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance. arXiv preprint arXiv:2305.17306, 2023.
- Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171, 2024.
- gkamradt. Llmtest-needleinahaystack. https://github.com/gkamradt/LLMTest_NeedleInAHaystack, 2023.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691, 2024.
- Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Llm maybe longlm: Self-extend llm context window without tuning. arXiv preprint arXiv:2401.01325, 2024.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734, 2024.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
- Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
- On context utilization in summarization with large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2764–2781, 2024.
- Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation. arXiv preprint arXiv:2408.08067, 2024.
- Randomized positional encodings boost length generalization of transformers. arXiv preprint arXiv:2305.16843, 2023.
- Triple preference optimization: Achieving better alignment with less data in a single step optimization. arXiv preprint arXiv:2405.16681, 2024.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pp. 31210–31227. PMLR, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Focused transformer: Contrastive training for context scaling. Advances in Neural Information Processing Systems, 36, 2024.
- Long context alignment with short instructions and synthesized positions. arXiv preprint arXiv:2405.03939, 2024a.
- Retrieval head mechanistically explains long-context factuality. arXiv preprint arXiv:2404.15574, 2024b.
- Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
- Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. arXiv preprint arXiv:2401.08417, 2024.
- Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
- Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
- Rrhf: Rank responses to align language models with human feedback. Advances in Neural Information Processing Systems, 36, 2024.
- Hengyu Zhang. Sinklora: Enhanced efficiency and chat capabilities for long-context large language models. arXiv preprint arXiv:2406.05678, 2024.
- Longcite: Enabling llms to generate fine-grained citations in long-context qa. arXiv preprint arXiv:2409.02897, 2024a.
- Extending llama-3’s context ten-fold overnight. arXiv preprint arXiv:2404.19553, 2024b.
- Pose: Efficient context window extension of llms via positional skip-wise training. arXiv preprint arXiv:2309.10400, 2023.
- Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931, 2024.
- Zecheng Tang (19 papers)
- Zechen Sun (2 papers)
- Juntao Li (89 papers)
- Qiaoming Zhu (15 papers)
- Min Zhang (630 papers)