Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 73 tok/s
Gemini 2.5 Pro 39 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 115 tok/s Pro
Kimi K2 226 tok/s Pro
GPT OSS 120B 461 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Improving Continual Pre-training Through Seamless Data Packing (2505.22018v2)

Published 28 May 2025 in cs.CL

Abstract: Continual pre-training has demonstrated significant potential in enhancing model performance, particularly in domain-specific scenarios. The most common approach for packing data before continual pre-training involves concatenating input texts and splitting them into fixed-length sequences. While straightforward and efficient, this method often leads to excessive truncation and context discontinuity, which can hinder model performance. To address these issues, we explore the potential of data engineering to enhance continual pre-training, particularly its impact on model performance and efficiency. We propose Seamless Packing (SP), a novel data packing strategy aimed at preserving contextual information more effectively and enhancing model performance. Our approach employs a sliding window technique in the first stage that synchronizes overlapping tokens across consecutive sequences, ensuring better continuity and contextual coherence. In the second stage, we adopt a First-Fit-Decreasing algorithm to pack shorter texts into bins slightly larger than the target sequence length, thereby minimizing padding and truncation. Empirical evaluations across various model architectures and corpus domains demonstrate the effectiveness of our method, outperforming baseline method in 99% of all settings. Code is available at https://github.com/Infernus-WIND/Seamless-Packing.

Summary

  • The paper introduces Seamless Packing, a two-stage strategy that minimizes truncation and padding to enhance contextual coherence and model performance.
  • It employs a dynamic sliding window and First-Fit-Decreasing heuristic to optimize data utilization, achieving up to fourfold improvement over baselines.
  • Comprehensive experiments across diverse domains, including multilingual settings, confirm the method’s robust generalization and reduced hallucination.

Seamless Data Packing for Enhanced Continual Pre-training

The paper "Improving Continual Pre-training Through Seamless Data Packing" (2505.22018) addresses the critical yet often overlooked aspect of data packing in continual pre-training of LLMs. It posits that conventional data packing methods, which rely on simple concatenation and truncation, can disrupt contextual continuity and introduce inefficiencies due to padding. To mitigate these issues, the authors propose Seamless Packing (SP), a novel two-stage data packing strategy designed to optimize contextual coherence while minimizing truncation and padding. The paper's central claim is that SP enhances model performance and generalization across diverse domains and tasks.

Seamless Packing Methodology

The proposed SP method comprises two sequential stages: Sliding Window and Packing with Dropping (Figure 1). Figure 1

Figure 1: An illustration of the Seamless Packing method.

The first stage, Sliding Window, processes long texts that meet specific length criteria using a dynamic sliding window technique. This technique maximizes contextual continuity by dynamically adjusting the overlap between consecutive sequences based on a predefined maximum repetition ratio (rmaxr_{max}). The condition for applying the sliding window is defined as Loriginal+Lmax_overlap≥(n+1)×LseqL_{original} + L_{max\_overlap} \ge (n+1) \times L_{seq}, where LoriginalL_{original} is the length of the original text, Lmax_overlapL_{max\_overlap} is the maximum allowed overlap, nn is the number of full sequences, and LseqL_{seq} is the sequence length. Texts that do not meet this condition are divided into chunks of length LseqL_{seq}, with incomplete chunks deferred to the second stage. The second stage, Packing with Dropping, addresses the remaining shorter texts using the First-Fit-Decreasing (FFD) algorithm, an approximation heuristic for the NP-hard bin packing problem. The bin capacity is slightly larger than LseqL_{seq}, controlled by the parameter CextraC_{extra}, and tokens exceeding LseqL_{seq} after packing are discarded. This strategy minimizes padding and truncation, ensuring efficient sequence utilization.

Theoretical Analysis of SP

The paper presents a theoretical analysis of the proportion of texts processed by each stage of SP, influenced by rmaxr_{max}. This analysis leverages the observation that text length distributions typically exhibit a decreasing trend as sequence length increases. By combining the conditions for applying the sliding window with the definition of nn, the authors derive an expression for the number of texts processed using the sliding window (NswN_{sw}):

Nsw=∑k=1⌊1rmax⌋Tk+⌈1rmax⌉T⌈1rmax⌉N_{sw} = \sum_{k=1}^{\lfloor \frac{1}{r_{max}} \rfloor} T_k + \lceil \frac{1}{r_{max}} \rceil T_{\lceil \frac{1}{r_{max}} \rceil}

where TkT_k denotes the number of texts with tokenized length within the interval (kLseq,(k+1)Lseq](kL_{seq}, (k+1)L_{seq}]. The analysis also provides an estimate for the total number of tokens in shorter chunks (Ntoken_shortN_{token\_short}) processed in the second stage:

Ntoken_short=∑k=1⌈1rmax⌉(1−krmax)Tk×(1−krmax)Lseq4N_{token\_short} = \sum_{k=1}^{\lceil \frac{1}{r_{max}} \rceil} (1 - kr_{max})T_k \times \frac{(1 - kr_{max})L_{seq}}{4}

Experimental Evaluation and Results

The efficacy of SP is evaluated through extensive experiments across diverse domains and models, including GPT-2, LLaMA-3, and Qwen2.5. The datasets used for continual pre-training include BBC News, financial news articles, and PubMed articles. The paper compares SP against several baselines, including the original model (OM), concatenation and truncation (CT), and Best-Fit-Decreasing (BFD).

The results demonstrate that SP consistently outperforms conventional data packing methods, achieving superior performance and generalization. Specifically, SP achieves a fourfold improvement over baseline methods (0.96% vs. 0.24%). The experiments encompass various evaluation metrics, including perplexity, full parameter fine-tuning, and LoRA tuning. SP consistently achieves the best results in the majority of cases, demonstrating its robustness across different domains and task types.

Furthermore, the paper presents a generalization analysis, evaluating SP under mixed-domain and general-domain settings. The results confirm that SP generalizes effectively across both domain-specific and general-domain continual pre-training scenarios. The cross-lingual applicability of SP is also examined, with experiments on a French dataset demonstrating its effectiveness in multilingual settings.

Ablation Studies and Hyperparameter Analysis

Ablation studies are conducted to evaluate the contributions of the two key stages in SP and to compare BFD and FFD. The results highlight the effectiveness of the sliding window technique and suggest that slightly increasing bin capacity enhances overall model performance. The paper also analyzes the impact of the hyperparameters rmaxr_{max} and CextraC_{extra}, finding that a moderate choice of rmax=0.3r_{max} = 0.3 achieves a balance between maintaining contextual continuity and preventing excessive redundancy (Figure 2a). Similarly, setting Cextra=50C_{extra} = 50 achieves the best trade-off between preserving essential context and effectively utilizing the additional capacity (Figure 2b). Figure 2

Figure 2

Figure 2: Influence of rmaxr_{\text{max}} and CextraC_{\text{extra}} on model performance.

To further demonstrate the importance of contextual continuity, a case paper is presented, showcasing that SP effectively reduces hallucination and improves factual consistency in downstream tasks. The paper also includes an empirical analysis of the trade-off between dropping and padding, as well as a computational efficiency comparison of BFD and FFD.

Conclusion

The paper makes a compelling case for the importance of data engineering in continual pre-training. By optimizing both segment placement and overlap strategy, SP preserves contextual continuity while reducing truncation and padding. The empirical results and theoretical analysis provide valuable insights into the design and implementation of data packing strategies for LLMs. The authors acknowledge limitations, including the need for a comprehensive theoretical framework explaining token dropping and padding dynamics, as well as further investigation into the generalizability of SP to other domains and pre-training settings.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.