Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LongAttn: Selecting Long-context Training Data via Token-level Attention (2502.16860v2)

Published 24 Feb 2025 in cs.CL

Abstract: With the development of LLMs, there has been an increasing need for significant advancements in handling long contexts. To enhance long-context capabilities, constructing high-quality training data with long-range dependencies is crucial. Existing methods to select long-context data often rely on sentence-level analysis, which can be greatly optimized in both performance and efficiency. In this paper, we propose a novel token-level framework, LongAttn, which leverages the self-attention mechanism of LLMs to measure the long-range dependencies for the data. By calculating token-level dependency strength and distribution uniformity of token scores, LongAttn effectively quantifies long-range dependencies, enabling more accurate and efficient data selection. We filter LongABC-32K from open-source long-context datasets (ArXiv, Book, and Code). Through our comprehensive experiments, LongAttn has demonstrated its excellent effectiveness, scalability, and efficiency. To facilitate future research in long-context data, we released our code and the high-quality long-context training data LongABC-32K.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Longyun Wu (1 paper)
  2. Dawei Zhu (46 papers)
  3. Guangxiang Zhao (17 papers)
  4. Zhuocheng Yu (1 paper)
  5. Junfeng Ran (3 papers)
  6. Xiangyu Wong (1 paper)
  7. Lin Sun (65 papers)
  8. Sujian Li (83 papers)

Summary

We haven't generated a summary for this paper yet.