Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LOGO -- Long cOntext aliGnment via efficient preference Optimization (2410.18533v1)

Published 24 Oct 2024 in cs.CL and cs.AI

Abstract: Long-context models(LCMs) have shown great potential in processing long input sequences(even more than 100M tokens) conveniently and effectively. With significant progress, recent research has pointed out that LCMs can accurately locate token-level salient information within the context. Yet, the generation performance of these LCMs is far from satisfactory and might result in misaligned responses, such as hallucinations. To enhance the generation capability of LCMs, existing works have investigated the effects of data size and quality for both pre-training and instruction tuning. Though achieving meaningful improvement, previous methods fall short in either effectiveness or efficiency. In this paper, we introduce LOGO(Long cOntext aliGnment via efficient preference Optimization), a training strategy that first introduces preference optimization for long-context alignment. To overcome the GPU memory-bound issue caused by the long sequence, LOGO employs a reference-free preference optimization strategy and adopts a position synthesis method to construct the training data. By training with only 0.3B data on a single 8$\times$A800 GPU machine for 16 hours, LOGO allows the Llama-3-8B-Instruct-80K model to achieve comparable performance with GPT-4 in real-world long-context tasks while preserving the model's original capabilities on other tasks, e.g., LLMing and MMLU. Moreover, LOGO can extend the model's context window size while enhancing its generation performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. AI@Meta. Llama 3-1 model card. Blob, 2024a. URL https://ai.meta.com/blog/meta-llama-3-1/.
  4. AI@Meta. Llama 3 model card. Blob, 2024b. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
  5. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–15. IEEE, 2022.
  6. anthropic. Claude-3-5-sonnet model card. blog, 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet.
  7. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
  8. Longalign: A recipe for long context alignment of large language models. arXiv preprint arXiv:2401.18058, 2024.
  9. Luna: An evaluation foundation model to catch language model hallucinations with high accuracy and low cost. arXiv preprint arXiv:2406.00975, 2024.
  10. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023a.
  11. Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307, 2023b.
  12. Long alpaca: Long-context instruction-following models. https://github.com/dvlab-research/LongLoRA, 2023c.
  13. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  14. Together Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  15. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  16. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  17. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance. arXiv preprint arXiv:2305.17306, 2023.
  18. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171, 2024.
  19. gkamradt. Llmtest-needleinahaystack. https://github.com/gkamradt/LLMTest_NeedleInAHaystack, 2023.
  20. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  21. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691, 2024.
  22. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024.
  23. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  24. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  25. Llm maybe longlm: Self-extend llm context window without tuning. arXiv preprint arXiv:2401.01325, 2024.
  26. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  27. Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734, 2024.
  28. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  29. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  30. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  31. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
  32. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  34. On context utilization in summarization with large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2764–2781, 2024.
  35. Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation. arXiv preprint arXiv:2408.08067, 2024.
  36. Randomized positional encodings boost length generalization of transformers. arXiv preprint arXiv:2305.16843, 2023.
  37. Triple preference optimization: Achieving better alignment with less data in a single step optimization. arXiv preprint arXiv:2405.16681, 2024.
  38. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  39. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pp.  31210–31227. PMLR, 2023.
  40. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  41. Focused transformer: Contrastive training for context scaling. Advances in Neural Information Processing Systems, 36, 2024.
  42. Long context alignment with short instructions and synthesized positions. arXiv preprint arXiv:2405.03939, 2024a.
  43. Retrieval head mechanistically explains long-context factuality. arXiv preprint arXiv:2404.15574, 2024b.
  44. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
  45. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. arXiv preprint arXiv:2401.08417, 2024.
  46. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
  47. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
  48. Rrhf: Rank responses to align language models with human feedback. Advances in Neural Information Processing Systems, 36, 2024.
  49. Hengyu Zhang. Sinklora: Enhanced efficiency and chat capabilities for long-context large language models. arXiv preprint arXiv:2406.05678, 2024.
  50. Longcite: Enabling llms to generate fine-grained citations in long-context qa. arXiv preprint arXiv:2409.02897, 2024a.
  51. Extending llama-3’s context ten-fold overnight. arXiv preprint arXiv:2404.19553, 2024b.
  52. Pose: Efficient context window extension of llms via positional skip-wise training. arXiv preprint arXiv:2309.10400, 2023.
  53. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zecheng Tang (19 papers)
  2. Zechen Sun (2 papers)
  3. Juntao Li (89 papers)
  4. Qiaoming Zhu (15 papers)
  5. Min Zhang (630 papers)

Summary

LOGO — Long Context Alignment via Efficient Preference Optimization

In this paper, the authors introduce a novel training strategy called LOGO, which stands for Long cOntext aliGnment via efficient preference Optimization, aimed at improving the performance of Long-context models (LCMs) in generating aligned responses for extensive input sequences. LCMs, despite their capability to handle and locate salient information within lengthy contexts, often falter in generating coherent and accurate outputs due to instances of misalignment, such as hallucinations or not following instructions. This research proposes a systematic approach to enhance the generative capabilities of LCMs without compromising their inherent strengths.

Contribution and Methodology

The primary contribution of the paper is a method that incorporates an efficient preference optimization strategy into the training of LLMs for better long-context alignment. The authors highlight two main challenges of learning with long-contexts: (1) the predominance of context over the prediction portion during training, diluting the effectiveness of Cross-Entropy (CE) loss for optimizing generation capabilities; (2) the substantial GPU memory requirements when processing these extensive inputs.

To address these, LOGO introduces the following:

  1. Preference Optimization Strategy: LOGO employs a preference optimization strategy, not requiring a reference model, aimed at maximizing the model's likelihood towards preferred responses versus dis-preferred (misaligned) ones by adjusting the reward-based scores during learning.
  2. Modified Training Objective: The training objective emphasizes distinguishing between correct outputs and misaligned responses such as hallucinations. It uses multiple dis-preference instances to refine the model further.
  3. Data Construction and Positional Index Synthesis: The pipeline constructs training samples robustly, involving chunking long input contexts to manage memory constraints efficiently. This approach uses positional index synthesis, which allows the model to simulate the effects of longer inputs without actually elongating the sequence length in the training data, thereby preserving memory resources.

Experimental Results and Evaluation

Extensively evaluating LOGO across various tasks demonstrated its efficacy. Key findings include:

  • Performance Parity with Closed-source Models: The LOGO-augmented Llama-3-8B-Instruct-80K model was able to achieve comparable performance to proprietary models like GPT-4 in handling long-context tasks, a substantial achievement for open-source initiatives.
  • Efficient Resource Utilization: Training required only 0.3B tokens over a span of 16 hours on an 8×A800 GPU machine, demonstrating LOGO's efficiency compared to traditional approaches demanding significantly more resources.
  • Enhancement in Variety of Tasks: Alongside long-context comprehension and generation tasks, LOGO maintained efficiency and enhanced performance on LLMing tasks (e.g., MMLU) without hindering short-context task performance, addressing the alignment tax often imposed by long context training.

Broader Implications and Future Work

LOGO sets a new paradigm for training LCMs by mitigating misalignment issues through robust methodology rather than just scaling context length or increasing the volume and quality of training data. The implications for AI research are significant as it suggests new avenues for training strategies that do not heavily rely on computational and data resources.

Future developments may look into refining models' error pattern recognition and context comprehension further or extending the application of LOGO-like strategies to various other domains or architectures. Understanding and optimizing long-context processing could unlock new capabilities and applications for LLMs, such as more advanced text summarization and processing of large-scale data inputs across domains like bioinformatics or legal document analysis.

In conclusion, this paper showcases an instrumental advancement in the alignment of LCMs for long-context tasks, offering insights and methodologies appealing for both academic exploration and practical deployments of LLMs.