Horizon-Length Prediction in Sequential Models
- Horizon-Length Prediction (HLP) is a method that estimates the remaining steps or tokens in a sequence, enabling more coherent long-context planning in generative tasks.
- It is implemented as an auxiliary training objective that predicts a normalized token fraction, adding minimal computational overhead while improving context alignment.
- Empirical results demonstrate significant performance gains in code infilling and reasoning tasks, with improvements up to 24% in key accuracy metrics.
Horizon-Length Prediction (HLP) refers to the explicit modeling, estimation, or learning of the number of steps, tokens, or states remaining until a particular target, event, or boundary is reached in a sequential process. This concept has become foundational in tasks where standard next-step models lack an inductive bias to plan over long contexts, especially when open-ended, variable, or task-specific horizons arise. HLP has seen systematic development and evaluation in code generation with fill-in-the-middle (FIM) architectures, event forecasting, high-dimensional control, and beyond.
1. Motivation and Problem Setting
The classical fill-in-the-middle (FIM) paradigm in code LLMs demands generating a missing code span (“middle”) given both left and right contexts. In conventional FIM, transformer-based models are trained using next-token prediction (NTP) over sequences reordered as prefix/suffix/middle (<pre> ... <suf> ... <mid> ... <eoi>). However, this approach leaves models agnostic to the number of middle tokens that must be produced before emission of the end-of-insertion (<eoi>) marker, leading to uncertain or incoherent boundaries between generated code and the provided suffix.
Previous work relied on brittle, dataset-specific post-processing (e.g., forcing output span to match the ground-truth line count), but such methods fail in open-domain code infilling. NTP alone lacks an explicit planning signal for “how much further” to generate, which is essential for seamless integration of infilled content with arbitrary contexts (Ding et al., 2024).
2. Horizon-Length Prediction Objective and Training
Horizon-Length Prediction is introduced as an auxiliary training objective that operationalizes lookahead planning for FIM tasks. The principle is to induce the model to predict, at each decoding step within the “middle,” the normalized remaining fraction of tokens before <eoi> should be emitted.
For middle span length and current within-span index , the horizon label is
A dedicated linear projection (hlp_head) is attached to the transformer’s final hidden state at each position . The predicted horizon is trained via regression loss:
The total loss is
where is standard next-token cross-entropy and is set (e.g., ) to balance the contributions. The normalized label formulation allows for natural generalization across variable-length infills and window sizes, requiring no binning or discretization.
Crucially, the hlp_head introduces extra parameters and is discarded after training, making HLP cost-free for inference latency and memory (Ding et al., 2024).
3. Architectural and Implementation Details
HLP is instantiated by modifying a standard FIM-trained code LLM (e.g., DeepSeek-Coder, StarCoder2) operating with PSM ordering. During continual pre-training:
- Each code sample is split at a random point into prefix, middle, and suffix, reordered to PSM format.
- Best-Fit Packing combines multiple files per sequence while masking cross-file attention for context diversity.
- Half of the data undergoes FIM reordering with HLP; the rest is used as standard L2R next-token prediction.
- Training uses AdamW optimizer, a cosine LR schedule, and batch size 512 across model families (model sizes 1.3B/6.7B/3B/7B).
No rule-based truncation or external information about ground-truth middle length is provided at inference or evaluation.
4. Empirical Evaluation and Quantitative Results
HLP’s impact is systematically measured across a diverse set of FIM and code-reasoning benchmarks:
- SAFIM: Pass@1 for execution-based syntactic infill tasks over 4 languages, up to 17,720 examples.
- CrossCodeEval/CrossCodeLongEval: File- and repository-level Python infills, with Exact Match (EM) and Edit Similarity (ES) metrics.
- RepoEval: Multi-task Python infills across 32 repositories.
- Defects4J: Java code repair tasks; metric: number of plausible patches that pass tests.
- CRUXEval: Input/output code reasoning (CRUXEval-I/O; ~800 Python functions).
Table: Representative HLP Improvements | Task / Metric | Baseline NTP | HLP-augmented | Relative Gain | |-----------------------------|:------------:|:-------------:|:-------------:| | SAFIM pass@1 (DeepSeek-1.3B) | 47.7% | 50.0% | +5% | | CrossCodeLongEval EM | 15.2% | 19.0% | +24% | | Defects4J repairs | 41 | 47 | +18% | | CRUXEval accuracy | — | up to +6% | up to +6% |
Relative gains up to 24% EM and 9% ES are observed for line-level repository infilling. HLP consistently boosts both coding (infill) and reasoning performance—without any reliance on externally supplied target lengths (Ding et al., 2024).
5. Analysis of Planning Capability and Ablation
To probe whether traditional NTP imparts any implicit horizon awareness, models were evaluated by fitting linear regressors to predict true from hidden states across 7.8M infill tokens:
- NTP-only:
- NTP+HLP: (training and held-out data)
This demonstrates that explicit horizon supervision is necessary for planning. A qualitative example reveals that, in the absence of HLP, the model terminates infill prematurely (e.g., emitting a function call too early), losing synchronicity with the right context. HLP-trained models align insertion precisely, “bridging” into the suffix without dataset-specific logic (Ding et al., 2024).
6. Computational Efficiency and Scalability
Adding HLP incurs only a negligible computational overhead at training ( parameter increase; minimal per-batch backprop cost through the hlp_head). There is zero runtime penalty at inference, as the planning head is omitted. All gains are observed without any degrading of inference speed or memory, ensuring real-world deployability for large-window generative code models.
HLP generalizes to arbitrary context window sizes due to the normalized label construction and is compatible with dynamic batching and packing strategies relevant for scalable code LLMs (Ding et al., 2024).
7. Broader Significance and Implications
HLP constitutes a general mechanism for endowing transformer-based sequence models with horizon-awareness, crucial for long-horizon planning and “lookahead” in generative modeling. It eliminates the need for brittle dataset-specific post-processing and aligns model behavior with open-domain infilling demands, where the number and structure of missing tokens are intrinsically unknown a priori.
By explicitly encoding remaining-horizon signals, HLP enables code LLMs to plan coherent insertions aligned with both left and right contexts—a fundamental advance in supporting programmable, context-sensitive generation. The utility of HLP is attested by consistent and substantial improvements across multiple model families, dataset granularities, and infilling tasks, with no trade-off in inference performance or generality (Ding et al., 2024).