SkyLadder: Better and Faster Pretraining via Context Window Scheduling (2503.15450v1)

Published 19 Mar 2025 in cs.CL

Abstract: Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our pilot study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose SkyLadder, a simple yet effective approach that implements a short-to-long context window transition. SkyLadder preserves strong standard benchmark performance, while matching or exceeding baseline results on long context tasks. Through extensive experiments, we pre-train 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that SkyLadder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines. The code is at https://github.com/sail-sg/SkyLadder.

Summary

The paper proposes SkyLadder, a novel context window scheduling approach that dynamically increases the window size during pretraining to improve efficiency and performance.
SkyLadder achieves significant results, including up to 22% faster training speeds and up to 3.7% accuracy gains on standard benchmarks like ARC-Easy, MMLU, and HellaSwag.
The method is practical for implementation, requires architectural modifications for dynamic masking, and is shown to scale effectively with model size, demonstrating reduced attention entropy.

Introduction

The SkyLadder approach introduces a novel context window scheduling mechanism that challenges the prevailing assumption that longer context windows during pretraining invariably lead to superior performance. By leveraging a short-to-long schedule, the methodology capitalizes on the empirically observed benefits of training with shorter context windows under a fixed token budget, thus enabling both efficiency gains and enhanced performance on standard benchmarks as well as long-context tasks.

Methodology

SkyLadder implements a dynamic schedule whereby the effective context window length is increased linearly over the course of training. Mathematically, the context window at training step t is described by:

$w(t) = \min(w_e, w_s + \lfloor \alpha t \rfloor)$

where:

$w_s$ is the initial, shorter context window,
$w_e$ is the target (longer) context window size,
$\alpha$ is the rate of expansion.

This strategy is integrated via masking mechanisms that constrain the attention patterns to a local window initially, before gradually allowing tokens to attend to increasingly distant positions. This incremental expansion helps in developing more stable and concentrated backpropagated gradients compared to the immediate use of long sequences—leading to faster convergence rates and improved generalization on both standard and long-context benchmarks.

Experimental Setup and Quantitative Results

The empirical evaluation of SkyLadder was conducted using two primary model configurations:

1B-parameter models pretrained with up to 32K context,
3B-parameter models pretrained with up to an 8K context window.

Both configurations were trained on a total of 100B tokens, providing extensive data to support the following quantitative observations:

Performance Gains on Standard Benchmarks:
- ARC-Easy: +7.4%,
- MMLU: +2.5%,
- HellaSwag: +4%.

These results indicate that the controlled increase in context window size does not compromise, and can in fact enhance, performance on tasks that do not explicitly require long-context capabilities.

Long-Context Task Performance:

On long-context tasks such as Multi-Document QA (MDQA) and RULER, SkyLadder's performance consistently matched or exceeded baseline methods. The method maintained the capacity to effectively process longer sequences while leveraging the training efficiency provided by initial short-context windows.

Training Efficiency:

The efficiency gains were marked by up to 22% faster training speeds with the largest context window configuration. For the 8K context window models, improvements of approximately 13% faster training were reported. This efficiency is attributed to the reduced complexity and improved concentration of early training phases.

Attention Pattern Analysis:

SkyLadder was shown to reduce attention entropy and slow down the emergence of attention sink. These characteristics suggest a more controlled evolution of attention distributions across layers, which in practical terms can lead to more robust downstream task performance.

Practical Implications for Implementation

For practitioners considering the integration of SkyLadder into pretraining pipelines, the following points are of particular note:

Architectural Modifications:

Implementing SkyLadder requires modifications to the pretraining loop to accommodate dynamic masking. The adaptation involves adjusting the attention mask based on the current training step as dictated by the schedule $w(t)$ .

Computational Considerations:

The strategy offers significant advantages in computational efficiency. The initial phase with shorter context windows reduces per-step memory and compute requirements, allowing for potential scaling benefits in large-scale distributed training frameworks.

Hyperparameter Tuning:

Key hyperparameters include the starting context window $w_s$ , the ending context window $w_e$ , and the expansion rate $\alpha$ . These parameters can be tuned based on the target application: tasks requiring fine long-context capabilities might push for a higher $w_e$ , while tasks with more local dependencies could benefit from a slower expansion schedule.

Integration with Existing Codebases:

The method has been implemented in the TinyLlama codebase, making it accessible for researchers who wish to test the concept on both pre-trained and custom models. The GitHub repository provided (https://github.com/sail-sg/SkyLadder) includes detailed instructions and sample implementations.

Scalability:

Experimental results demonstrate that the improvements persist when scaling to larger model sizes. For industrial applications requiring models with extended context lengths, SkyLadder provides a viable and efficient alternative to traditional long-context pretraining methods.

Conclusion

SkyLadder offers a technically robust, empirically validated strategy to reconcile the trade-offs between pretraining efficiency and long-context capabilities. By leveraging a controlled, layer-wise expansion of the context window, the method achieves significant improvements in both training speed (up to 22% faster) and benchmark performance (up to 3.7% accuracy gains), alongside robust results on challenging long-context tasks. This approach provides a pragmatic pathway for deploying large-scale LLMs in settings where computational resources are a critical bottleneck without sacrificing performance on end tasks.