End-of-Text Prediction in Language Models

Updated 30 September 2025

End-of-Text Prediction (EoTP) is a mechanism that defines when generative systems, such as language or ASR models, should stop output using explicit tokens like EOS and EOP.
Techniques including EOS token weighting and joint ASR endpointing improve output quality by reducing perplexity and latency while maintaining sequence integrity.
Research on EoTP balances accurate stopping with challenges like length extrapolation and 'length attractors', guiding innovations in both model training and inference.

End-of-Text Prediction (EoTP) refers to the problem of determining or modeling the boundary at which a generative system, such as a LLM or an automatic speech recognition (ASR) system, should stop producing output. This encompasses both explicit mechanisms (like special “end-of-sequence” or EOS tokens) and implicit approaches for demarcating termination in generated text, paragraphs, utterances, or queries. The challenge is central to natural language generation, summarization, dialog systems, and speech applications, where both the content and structure of output are constrained by end-of-text behavior.

1. Modeling End-of-Text: Tokens and Probabilistic Frameworks

In auto-regressive LLMs, EoTP is articulated through token-based mechanisms. The general probabilistic factorization for a sequence $w_{1:T}$ , given a context $W_{0}$ , is

$P(w_{1:T} \mid W_0) = \prod_{t=1}^{T} P(w_t \mid w_{1:t-1}, W_0)$

Typically, the end-of-sequence (EOS) token determines when generation halts, enforced by the model reaching $P(w_T = \text{EOS} \mid w_{1:T-1}, W_0)$ at a sufficiently high value (Bai et al., 2020). For multi-paragraph content, end-of-paragraph (EOP) tokens further structure text, giving models explicit cues to break and conclude sub-segments.

In encoder-decoder ASR systems, EoTP can be realized via a dedicated EOS token, attention-based termination, or frame-level endpointing signals. These systems may blend linguistic and acoustic streams to determine when the speaker's contribution is over (Bijwadia et al., 2022, Zink et al., 30 Sep 2024).

2. Effects of EOS and EOP Tokenization on Generation and Metrics

Explicit inclusion of EOS and EOP tokens significantly impacts the semantic and formal quality of generated text. In GPT-2 style models, EOP tokens yield improvements across metrics such as BLEU, truncated BLEU (T-BLEU), token-level perplexity, and EOS perplexity. For example, in Chinese-GPT2 fine-tuned with both EOS and EOP tokens, the portion of texts with the correct terminal EOS increases from 76.41% (EOS only) to 93.07%, and EOS perplexity drops from 22.15 to 2.74. BLEU and distinctiveness metrics also improve in tandem (Bai et al., 2020).

For English models pre-trained with newlines serving as rough paragraph dividers, adding explicit EOP tokens at fine-tuning further reduces generation perplexity and boosts BLEU scores. Truncating generations to ground-truth length (T-BLEU) confirms these gains are robust and not merely artifacts of output length.

Configuration	EOS Ending %	EOS Perplexity	BLEU Score ↑
EOS only	76.41	22.15	Baseline
EOS + EOP	93.07	2.74	Improved

3. Inductive Effects: Length Extrapolation, Manifolds, and Attractors

Prediction of EOS tokens introduces inductive biases. Models forced to predict explicit EOS tokens develop “length manifolds”: their hidden states stratify by absolute position, impeding generalization to much longer (or shorter) sequences than those seen at training. These models also tend towards “length attractors,” dynamically clustering hidden state trajectories once the EOS token probability peaks, thus halting meaningful continuation (Newman et al., 2020).

By contrast, models that do not predict EOS tokens (-EOS models) avoid these artifacts and extrapolate better to longer outputs. In bracket closing (Dyck) and SCAN generalization tasks, -EOS models outperform +EOS models by up to 40% in exact match accuracy and maintain performance for sequences up to 10× training length. These findings point to a trade-off: explicit EOS tokens facilitate probabilistically correct stopping, but may impair generalization when output lengths diverge from those encountered in training data.

4. End-of-Text Prediction in Speech Systems

In ASR and dialog systems, EoTP is closely tied to endpointing, where models must rapidly and accurately determine when a user’s utterance or query has ended. Unified end-to-end approaches train ASR and endpointing jointly, often sharing encoder representations. Frame-level endpointing models classify each frame as speech, silence, or query end, optimizing a multitask loss

$\mathcal{L}_{multi} = \lambda \mathcal{L}_{ASR} + (1-\lambda)\mathcal{L}_{EP}$

with “switch” connections that allow the endpointing head to condition on either raw audio frames or latent ASR features. This design yields faster and more accurate end-of-query detections (e.g., 30.8% and 23.0% reductions in median and 90th-percentile latency, respectively) without degrading WER (Bijwadia et al., 2022).

For predictive dialog systems, masking future segments during training encourages the decoder to fill in upcoming words and anticipate the end of utterances. Cross-attention mechanisms localize the probable EOU using attention maxima thresholds. Models using such anticipatory training reduce EOU error by more than half compared to baselines and can make confident end-of-utterance predictions up to 300 ms before the true endpoint (Zink et al., 30 Sep 2024).

5. Methods for Controlling Generation Length and Termination

Length control in generation is critical for summarization, story generation, and dialogue. One simple, architecture-agnostic approach is to increase the loss weight assigned to the EOS token during training:

$L_2 = -\frac{R}{N}\sum_{n=1}^N w_{y_n}\log \left( \frac{\exp(x_n^{(y_n)})}{\sum_v \exp(x_n^{(v)})} \right)$

with weighting

$w_{y_n} = \begin{cases} W & \text{if } y_n = [EOS] \ 1 & \text{otherwise} \end{cases}$

and normalization $R = N / (N + W - 1)$ (Belligoli et al., 5 Jun 2025).

Increasing $W$ tunes the model’s sensitivity to correct stopping: higher $W$ values cause generation to halt earlier, reducing the likelihood of output exceeding length constraints. This method is compatible with both decoder-only and encoder-decoder LMs, and is orthogonal to other decoding-time length control techniques. Empirically, switching from $W=1$ to $W=10$ reduced overly long summary percentage substantially in T5-base and Llama-2 7B models with minimal change in ROUGE-2 or BERTScore.

6. Practical Implications, Limitations, and Open Directions

EoTP mechanisms, including EOS/EOP tokenization and their associated weighting or masking strategies, directly affect both the form and content quality of generated sequences. Incorporating paragraph-level boundary tokens fosters improved semantic modularity and more natural endings, as shown by decreases in perplexity and increases in BLEU/T-BLEU metrics (Bai et al., 2020).

However, over-reliance on explicit position tracking for EOS can degrade generalization to out-of-domain lengths (Newman et al., 2020). To mitigate this, integrating complementary strategies such as masking, alternate termination signals, or regularization to break “length attractor” dynamics may be warranted.

In ASR, joint models that combine frame-level, acoustic-based predictions with decoder-driven EoTP improve both latency and recognition accuracy (Bijwadia et al., 2022). Predictive models that anticipate future text or speech not only reduce system response time but also serve as a theoretical bridge to text-based anticipatory completion systems (Zink et al., 30 Sep 2024).

Still, choices around EOS/EOP tokenization, weighting, and architecture carry trade-offs between probabilistic correctness, generalization capacity, latency, and compatibility with pre-trained systems. Selecting or combining these approaches depends on task requirements (e.g., constrained summarization, open-ended generation, real-time dialog, multilingual generation), resource profiles, and desired system behaviors.

7. Summary Table: Selected EoTP Methods and Their Effects

Methodology	Benefits	Caveats / Trade-offs
EOS/EOP token addition (Bai et al., 2020)	↑ BLEU, ↓ perplexity, improved structure	Dependent on tokenization/marking in pre-training
EOS token weighting (Belligoli et al., 5 Jun 2025)	Controllable output length, architecture-agnostic	High weights may suppress content, minor impact on quality
No-EOS for extrapolation (Newman et al., 2020)	Length generalization, avoids attractors	May lose probabilistic “properness” for stopping
Joint ASR/endpointing (Bijwadia et al., 2022)	↓ endpoint latency, robust EoTP	Complexity, reliance on shared layers
Predictive masking + cross-attention (Zink et al., 30 Sep 2024)	Anticipatory outputs, early EOU prediction	Complexity in training, application scope

The design of End-of-Text Prediction thus remains a multi-dimensional task, balancing structural, metric, and operational priorities, with ongoing research into optimal mechanisms for different domains and modalities.