LongAttn: Selecting Long-context Training Data via Token-level Attention (2502.16860v2)

Published 24 Feb 2025 in cs.CL

Abstract: With the development of LLMs, there has been an increasing need for significant advancements in handling long contexts. To enhance long-context capabilities, constructing high-quality training data with long-range dependencies is crucial. Existing methods to select long-context data often rely on sentence-level analysis, which can be greatly optimized in both performance and efficiency. In this paper, we propose a novel token-level framework, LongAttn, which leverages the self-attention mechanism of LLMs to measure the long-range dependencies for the data. By calculating token-level dependency strength and distribution uniformity of token scores, LongAttn effectively quantifies long-range dependencies, enabling more accurate and efficient data selection. We filter LongABC-32K from open-source long-context datasets (ArXiv, Book, and Code). Through our comprehensive experiments, LongAttn has demonstrated its excellent effectiveness, scalability, and efficiency. To facilitate future research in long-context data, we released our code and the high-quality long-context training data LongABC-32K.

Authors (8)

Longyun Wu (1 paper)
Dawei Zhu (46 papers)
Guangxiang Zhao (17 papers)
Zhuocheng Yu (1 paper)
Junfeng Ran (3 papers)
Xiangyu Wong (1 paper)
Lin Sun (65 papers)
Sujian Li (83 papers)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

LongAttn: Selecting Long-context Training Data via Token-level Attention (2502.16860v2)

Summary

Related Papers