Next-Sentence Prediction in NLP

Updated 2 June 2026

Next-Sentence Prediction (NSP) is a natural language processing task that predicts the subsequent sentence given a textual context.
It employs diverse modeling architectures—such as Transformers, MLPs, and latent-space models—to score and rank candidate continuations efficiently.
NSP applications include pre-training, dialogue evaluation, summarization, and cross-lingual comprehension, driving advances in discourse and coherence.

Next-Sentence Prediction (NSP) is a family of tasks and modeling paradigms in natural language processing that require a system to predict or discriminate the next sentence given a preceding textual context. NSP has evolved to serve as both a self-supervised pre-training objective and as a practical evaluation protocol for coherence, discourse understanding, and long-range planning in machine reading, story generation, dialogue, summarization, and cross-lingual comprehension. NSP formulations and modeling architectures have diverged significantly since their introduction, adapting to advances in contextual representations, efficiency, and end-task requirements.

1. Formal Task Definitions and Variants

The fundamental NSP objective is to estimate, given a context sequence $S = (s_1, ..., s_t)$ , which of a set of candidate sentences $C = \{c_1, ..., c_N\}$ is the true next sentence $s_{t+1}$ . In early forms—exemplified by BERT pre-training—this is a binary classification: given $(A,B)$ , predict if $B$ immediately follows $A$ (IsNext) or is a randomly sampled sentence (NotNext) (Aroca-Ouellette et al., 2020, Sun et al., 2021). The input is passed through a Transformer, with supervision at the $[\text{CLS}]$ token via softmax cross-entropy:

$L_{\mathrm{NSP}} = -\sum_{i} y_i \log p(y_i|A_i,B_i)$

where $y_i \in \{$ IsNext, NotNext $\}$ .

Multi-way NSP generalizes this to more candidates. For instance, in sentence ranking over stories, the model scores all candidate continuations by

$C = \{c_1, ..., c_N\}$ 0

using embeddings $C = \{c_1, ..., c_N\}$ 1 for $C = \{c_1, ..., c_N\}$ 2 and context vector $C = \{c_1, ..., c_N\}$ 3, inducing a softmax over $C = \{c_1, ..., c_N\}$ 4 (Ippolito et al., 2020).

Complex variants include multi-class settings with “Previous Sentence Prediction” (PSP) and document-level distractors (Xu et al., 2019), four-way multiple-choice with trigram constraints (Popov et al., 2019), or autoregressive, latent-space NSP where the next latent embedding is predicted rather than discrete text (Hwang et al., 28 May 2025, An et al., 2024).

In evaluation protocols (e.g., cross-lingual, dialog, summarization), NSP is instantiated as a selection/ranking problem among $C = \{c_1, ..., c_N\}$ 5 fluent options, probing discourse comprehension and narrative or logical continuity (Chavan et al., 29 Oct 2025, Zhao et al., 2023, Liu et al., 2019).

2. Modeling Architectures and Sentence-Level Representations

NSP models extensively leverage pre-trained sentence or contextual embeddings to encode the context and candidates.

Sentence Embedding Based NSP: Context and candidate sentences are mapped to fixed-size embeddings using BERT (mean of penultimate layer outputs), ELMo, or FastText (Ippolito et al., 2020, Popov et al., 2019). The context ( $C = \{c_1, ..., c_N\}$ 6) is concatenated into $C = \{c_1, ..., c_N\}$ 7, and passed to either a multi-layer perceptron (MLP) or residual MLP (resMLP) to compute a context vector $C = \{c_1, ..., c_N\}$ 8. Candidates are scored by dot product $C = \{c_1, ..., c_N\}$ 9 and ranked via softmax (Ippolito et al., 2020).
Transformer and Latent Models: For latent-space sentence NSP, GPT-2 or other decoder-only models produce sentence embeddings; the downstream model autoregressively predicts future sentence embeddings, optionally training with InfoNCE contrastive terms to align contextual and semantic representations (Hwang et al., 28 May 2025). Generation can operate in discretized (decode to text) or continuous (reason in $s_{t+1}$ 0) regimes.
Variational Methods: SentenceVAE compresses each sentence via self-attention encoder into a single vector $s_{t+1}$ 1, reconstructs the sentence with a sentence-level decoder, and feeds these vectors into a sentence-level LLM for NSP (An et al., 2024).
Classification Heads and Prompting: Binary or multi-way NSP classifiers typically use a single-layer or shallow MLP on top of $s_{t+1}$ 2 representations (Aroca-Ouellette et al., 2020), or prompt engineering for few/zero-shot downstream adaptation (e.g., NSP-BERT) (Sun et al., 2021).

Architectural choices impact efficiency: pre-computation of sentence embeddings and dynamic batched distractor sampling allow efficient scoring over tens of thousands of alternatives per batch on NSP tasks, in sharp contrast to the computational load of word-level sequence models (Ippolito et al., 2020, An et al., 2024).

3. Training Objectives and Negative Sampling

Most NSP systems are trained using cross-entropy loss over the candidate set $s_{t+1}$ 3, maximizing likelihood for the correct next sentence. For $s_{t+1}$ 4-way NSP, the loss for a batch is:

$s_{t+1}$ 5

where $s_{t+1}$ 6 is determined by normalized exponentiated compatibility scores.

Negative sampling is critical:

BERT NSP: 50% next-sentence (intra-doc), 50% random (inter-doc) negatives (Aroca-Ouellette et al., 2020, Sun et al., 2021).
Advanced Sampling: Contextual negatives include in-adjacent sentences, semantically similar distractors, or hard negatives with topic overlap (Xu et al., 2019, Ippolito et al., 2020). CSLoss penalizes a model for scoring context sentences highly as next, mitigating degenerate behaviors (Ippolito et al., 2020).
Contrastive Losses: InfoNCE is employed to align predicted and gold sentence embeddings in latent autoregressive NSP (Hwang et al., 28 May 2025).

Auxiliary losses and multi-task training—such as adjacent sentence prediction, ordering, term frequency, or mutual information objectives—can supersede or supplement NSP, delivering improved end-task transferability (Aroca-Ouellette et al., 2020, Zhao et al., 2023).

4. Applications and Empirical Performance

NSP has been central to a diverse set of NLP and NLU domains:

Pre-training and Representation Transfer: While initially used for self-supervised pre-training (BERT), NSP’s effectiveness depends on downstream needs. Its mainline use has shifted away, with alternatives (masked LM only, or more semantically detailed tasks) outperforming it on average downstream GLUE metrics (Aroca-Ouellette et al., 2020).
Story and Narrative Coherence: Sentence-level NSP models that reason over story contexts achieve state-of-the-art accuracy on the unsupervised Story Cloze test (73.0% MLP+CSLoss vs. GPT-2 55.4%), and perform large-scale ranking over 98k candidates (P@10 up to 10.3%) (Ippolito et al., 2020).
Extractive Summarization: Nextsum predicts summary sentences autoregressively, leveraging document structure, redundancy, coverage, and content importance features; resulting summaries achieve ROUGE-2 scores superior to strong baselines and variable-length prediction correlating with human summaries (Liu et al., 2019).
Multilingual and Cross-Lingual Comprehension: In large-scale benchmarks spanning English, Swahili, and Hausa, LLMs excel at NSP in English but suffer sharp drops in low-resource languages. Chain-of-Thought prompting yields non-monotonic effects, benefiting weaker LLMs but “overcomplicating” decision-making in stronger ones (Chavan et al., 29 Oct 2025).
Open-Domain Dialogue Evaluation: NSP-augmented CVAEs improve discrimination among one-to-many plausible next responses, regularizing latent space for coherent continuation beyond mere reconstruction (Zhao et al., 2023).
Computational Efficiency: Sentence-level NSP models and SLLMs demonstrate 2–4 $s_{t+1}$ 7 inference speedups, memory footprint reductions of 86–91%, and perplexity improvements (e.g., from 26.75 to 14.32 with SLLM-125M) by compressing sequences to sentences, as opposed to token-level LLMs (An et al., 2024).

5. Limitations and Critiques

Empirical studies highlight critical weaknesses of classical binary NSP and its descendants:

Shallow Semantic Signal: NSP can be solved to a large extent by superficial cues (overlap, topic) rather than deep inference. Context splitting in 50/50 pre-training artificially truncates sequences, inhibiting modeling of long-range dependencies (Aroca-Ouellette et al., 2020).
Task Mismatch: For pure language understanding and reasoning, auxiliary objectives such as sentence ordering, adjacent prediction, or soft semantic clustering are empirically more effective (Aroca-Ouellette et al., 2020, Xu et al., 2019).
Capacity Bottlenecks: Sentence-level models may struggle with extremely long contexts if capacity is limited to shallow MLP architectures (Ippolito et al., 2020).
Dependence on Embedding Pre-training: Limitations in the “upstream” encoder (e.g., BERT’s pre-trained representations omitting world or character-specific information) can propagate infelicities to the NSP predictions (Ippolito et al., 2020, Popov et al., 2019).
Negative Transfer in Prompting: In multi-lingual NSP tasks, advanced prompt strategies such as Chain-of-Thought can degrade the performance of high-capacity LLMs, highlighting non-trivial interactions between prompting, model strength, and data resource levels (Chavan et al., 29 Oct 2025).

6. Innovations and Extensions

Recent work explores multiple orthogonal axes for advancing NSP:

Sentence Embedding and Decoding: Latent-space NSP and sentence-level autoencoders facilitate fast, scalable reasoning over abstracted discourse units. Continuous inference regimes allow full reasoning in embedding space without textual decoding, significantly increasing efficiency (Hwang et al., 28 May 2025, An et al., 2024).
Symmetric and Document-Level NSP: Augmenting NSP with previous-sentence prediction and in-adjacent negatives fosters order-invariant, robust representations. Label smoothing enables the inclusion of noisy, closely-spaced negatives (Xu et al., 2019).
Prompt-Based and Few-Shot NSP: NSP-repurposed prompt templates allow variable-length label handling, entity linking, and unified zero-shot adaptation, provided NSP pre-training is available (Sun et al., 2021).
Evaluation and Visualization Tools: Intermediate decoding (“SentenceLens”) enables tracking and inspection of latent NSP states for interpretability in structured reasoning (Hwang et al., 28 May 2025).
Multimodal and Hierarchical Extensions: Potential future developments include NSP over multimodal units or hierarchical units (paragraphs, document chunks) (An et al., 2024).

7. Tables: Representative NSP Modeling Approaches

Approach	Model/Embedding	Candidate Set Size	Context Length	Target Tasks
BERT-NSP	Transformer ([CLS] head)	2	1–2	Pretraining, NLU bench.
Sentence-level MLP (Ippolito et al., 2020)	BERT embedding + MLP	up to 100k	4–8	Story Cloze, large-scale NSP
NSP-BERT (Sun et al., 2021)	Sentence-pair prompt	2–20+	Flexible	Zero/few-shot classification
SentenceVAE-SLLM (An et al., 2024)	Self-attn encoder, LLM	10–100k	up to 64	Fast inference, long-context
Latent NSP (Hwang et al., 28 May 2025)	GPT-2-derived encoder	N/A (gen.)	Variable	Math, logic, commonsense plans

Employing these architectures and objectives, NSP continues to be an active research area as a diagnostic, efficiency-enhancing, and coherence-preserving framework in modern LLM systems.