Future Summary Prediction for LLMs

Updated 17 October 2025

Future Summary Prediction (FSP) is an auxiliary training paradigm that predicts a compressed summary of upcoming tokens to capture long-range dependencies and global context.
It leverages either a handcrafted bag-of-words or a learned summary via a reverse language model to improve planning, coherence, and parameter efficiency.
Empirical studies show that FSP boosts performance across reasoning, coding, and planning tasks by 2-5% compared to traditional next-token and multi-token prediction methods.

Future Summary Prediction (FSP) is an auxiliary training paradigm for LLMs designed to overcome the limitations of standard next-token prediction (NTP) and multi-token prediction (MTP) objectives by enabling models to anticipate and encode a compact representation of long-horizon future content. FSP shifts the training problem from local, teacher-forced sequence modeling toward globally attentive, summary-based supervision that preserves information relevant to the future structure of generated sequences, including long-term dependencies and global context. This approach is motivated by the observation that NTP and MTP inherently foster short-range correlations, limiting the ability of models to reason over extended contexts, plan, and maintain coherence in long-form generative tasks (Mahajan et al., 16 Oct 2025).

1. Motivation and Theoretical Framework

NTP has served as the foundational objective underlying most modern LLM pretraining: models are updated via teacher-forced supervision to predict $x_{t+1}$ conditioned on tokens $x_{\leq t}$ . While effective for learning local grammatical consistency, this strategy introduces "exposure bias" and encourages shortcut solutions that privilege immediate patterns, hindering performance in generation requiring coherence across long segments, multi-step reasoning, or planning. MTP partially addresses these issues by predicting blocks of future tokens, but its improvements are limited since it typically decomposes the joint future prediction as independent predictions over a small window, failing to solicit global structure awareness.

FSP reconceptualizes the prediction target: instead of attempting to generate each immediate token or even a local block, the model is required—via an auxiliary head—to predict a summary vector that compresses a long segment of the future sequence. Formally, at position $t$ , FSP seeks to predict a summary representation $a(t, \tau)$ capturing salient attributes of $x_{t+2}, ..., x_{t+\tau}$ , where $\tau$ is a fixed or variable horizon. This summary could be a handcrafted bag-of-words representation or a learned, dense embedding produced by a reverse LLM (RevLM) operating right-to-left over the future sequence. By training to minimize the loss $\ell_a(A_\varphi(x_{\leq t}), a(t,\tau))$ alongside the standard token-level loss, the model is explicitly incentivized to encode long-range future characteristics in its hidden states.

This auxiliary task provides global supervision, introduces a new source of information bottlenecking (forcing abstraction), and ensures the model retains future-relevant features that are not reducible to token-local dependencies.

2. Implementation Strategies

Two concrete variants of FSP are examined:

a. Handcrafted (Bag-of-Words) Summary (FSP-BCE)

For a vocabulary of size $V$ , a multi-hot vector $a(t, \tau) \in \{0,1\}^V$ is constructed where $a_i(t, \tau) = 1$ iff token $i$ appears in $\{x_{t+2}, \ldots, x_{t+\tau}\}$ .
The model attaches an auxiliary head $A_\varphi$ after the backbone transformer, parameterized independently from the next-token prediction (NTP) head.
A reweighted binary cross-entropy loss (weighted by tf-idf or similar, $w(i)$ ) is used to encourage learning of informative tokens and avoid dominance by high-frequency but uninformative entries:

$\ell_a(A_\varphi(x_{\leq t}), a(t, \tau)) = -\sum_{i=1}^{V} w(i)\big[a_i \log \sigma(z_i) + (1 - a_i)\log (1 - \sigma(z_i))\big]$

where $z_i$ is the output logit for token $i$ , and $\sigma$ is the sigmoid function.

b. Learned Summary via Reverse LLM (FSP-RevLM)

An independent RevLM is trained right-to-left on future tokens $x_{t+2}, ..., x_T$ to produce a hidden state (summary vector) $a(t, T-t) = g_h \circ g_s(x_{\geq t+2})$ .
The FSP head $A_\varphi(x_{\leq t})$ is trained to minimize the L2 distance to this learned future summary:

$\ell_a(A_\varphi(x_{\leq t}), a(t, T-t)) = \|A_\varphi(x_{\leq t}) - a(t, T-t)\|_2^2$

This approach enables adaptive, information-efficient summarization of the future, as the RevLM can prioritize non-redundant, globally relevant features.

The overall training objective for both approaches is:

$L_{\text{FSP}}(X, P_\theta) = L_{\text{NTP}}(X, P_\theta) + \mathbb{E}_{x \sim \mathcal{P}_x}[\ell_a(A_\varphi(x_{\leq t}), a(t,\tau))]$

where $L_{\text{NTP}}$ is the traditional next-token log-likelihood.

Notably, unlike MTP (which often requires a separate decoder head per predicted token block), FSP only requires a single auxiliary head, scaling efficiently for large $\tau$ .

3. Empirical Findings and Benchmark Results

Large-scale experiments performed at the 3B and 8B parameter scale demonstrate robust, consistent improvements of FSP over NTP and MTP across diverse domains:

On general reasoning (ARC-Easy), FSP-RevLM achieves 76.6% accuracy at 8B scale (NTP: 71.8%, MTP: 73.6%).
In coding (MBPP, HumanEval+), math reasoning (GSM8K, Math-500), and synthetic, long-horizon planning tasks, FSP delivers accuracy gains in the 2-5% range.
FSP-BCE (bag-of-words) achieves perfect generalization on synthetic graph reasoning tasks where both NTP and MTP fail, indicating its effectiveness in long-range planning and information abstraction scenarios.
FSP-RevLM shows enhanced output diversity and faster convergence, particularly as it is able to focus summary prediction on the most informative future features, filtering out irrelevant tokens.

These results validate the theoretical expectation that FSP improves a model’s ability to encode, plan, and reason over extended contexts, leading to enhanced performance on benchmarks sensitive to global dependency modeling.

4. Architectural Characteristics

The underlying architecture in each case retains a standard transformer backbone with the NTP head:

$P_\theta(x_{t+1} | x_{\leq t}) = \text{softmax}(f_u \circ f_h \circ f_s(x_{\leq t}))$

FSP augments this with an auxiliary head:

$A_\varphi(x_{\leq t}) = f_{h(a)}' \circ f_s(x_{\leq t})$

For the handcrafted summary, this output is the multi-hot vector over the vocabulary; for the learned summary, it is a dense vector embedding which attempts to match the RevLM future summary.

Compared to MTP, which requires multiple auxiliary heads to cover each position in a future block, FSP's single summary head is more parameter efficient when scaling to large future windows, and applies greater supervision per unit compute over long-range dependencies.

5. Applications and Implications

FSP is designed to enhance long-horizon generative abilities in LLMs, targeting applications where global structure is critical:

Creative writing and storytelling: By enforcing future-aware encoding, FSP facilitates narrative arc coherence, planning, and referential consistency across long-form text.
Long-form and technical document generation: FSP-trained models can better maintain structure, avoid contradiction or topical drift, and enforce planning across sections.
Mathematical and logical reasoning: Multi-step deduction, chain-of-thought solutions, and tasks requiring intermediate planning are improved by global future supervision.
Program synthesis and code completion: Hierarchical and modular structure in code benefits from models that can anticipate requirements and maintain cross-block relationships, reducing compounding errors from autoregressive sampling.

The summary-based auxiliary task also displays robust generalization beyond standard generative settings, yielding improvements in synthetic reasoning and planning tasks that challenge the local dependency modeling of NTP and MTP.

6. Limitations and Future Research Directions

The paper identifies several areas for further development:

Refinement of summary encoders, such as revising RevLM architectures or training methods to further emphasize abstract, higher-level dependencies while suppressing noisy or redundant future content.
Exploration of dynamic summary windows or adaptive loss functions that tailor the summarization horizon to context, sequence type, or downstream objectives.
Testing at model scales larger than 8B to investigate scaling trends.
Broadened applications to dialogue, interactive planning agents, and fine-grained, structured outputs.
Integration of FSP objectives with curriculum learning or further auxiliary objectives to enhance the model’s supervision signal and environmental adaptability.

By shifting the pretraining regime from local token-level supervision toward auxiliary summary-based objectives, FSP represents a principled advance in the statistical and practical modeling of global sequence dependencies, planning, and abstraction within LLMs (Mahajan et al., 16 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries (2025)

Follow Topic

Get notified by email when new papers are published related to Future Summary Prediction (FSP).