SegNSP: Neural Text Segmentation

Updated 14 January 2026

SegNSP is a neural approach that reframes linear text segmentation as a next sentence prediction task to detect topic transitions without explicit labels.
It leverages a BERT-based encoder and a segmentation-aware loss that integrates focal loss, confidence penalty, and boundary loss to address class imbalance and boundary sparsity.
SegNSP achieves superior Boundary F1 scores on public datasets, enhancing downstream tasks like summarization, information retrieval, and question answering.

SegNSP is a neural approach to linear text segmentation in NLP that formulates the segmentation task as a next sentence prediction (NSP) problem. It leverages input representations and learning objectives specifically tailored to identifying segment boundaries, such as topic transitions, without the need for explicit topic labels or taxonomies. SegNSP achieves state-of-the-art results on public English and Portuguese segmentation benchmarks, demonstrating significant improvements over classical and neural baselines and offering robust, label-agnostic performance for segmenting continuous text into coherent, semantically meaningful units (Isidro et al., 7 Jan 2026).

1. Linear Text Segmentation as Next Sentence Prediction

SegNSP approaches linear text segmentation by explicitly modeling sentence-to-sentence continuity using the NSP formalism. Given a document $D = (s_1, s_2, ..., s_n)$ split into $n$ sentences and a segmentation $G$ consisting of $m$ contiguous segments, a segment boundary is defined to exist between sentences $s_i$ and $s_{i+1}$ if they belong to different segments. For each adjacent sentence pair $(s_i, s_{i+1})$ , the model constructs the input representation as $[CLS]\ s_i\ [SEP]\ s_{i+1}\ [SEP]$ , and encodes it with a pretrained BERT model to obtain $h_{[CLS]}$ .

A linear classification head projects $h_{[CLS]}$ to two logits, applying softmax to yield the probability distribution $P(y|s_i, s_{i+1}) = softmax(W \cdot h_{[CLS]} + b)$ , where $y \in \{\text{is\_next}, \text{not\_next}\}$ . During inference, a boundary is predicted at position $i$ if $P(y=\text{not\_next}|s_i, s_{i+1}) > \tau$ , with threshold $\tau=0.5$ tuned on validation data (Isidro et al., 7 Jan 2026).

2. Label-Agnostic NSP Formulation and Segmentation-Aware Loss

SegNSP uses a label-agnostic variant of next sentence prediction. Each sentence pair $(s_i, s_{i+1})$ receives a positive label ( $y_i=1$ ) if the next sentence continues the same topic, and negative ( $y_i=0$ ) at a topic boundary. No explicit topic labels or external taxonomies are required—only binary next/boundary information.

The segmentation-aware loss combines three components:

Focal loss $\mathcal{L}_{focal}$ to address class imbalance:

$\mathcal{L}_{focal} = -\sum_{i=1}^N \alpha_{t_i}(1 - \hat{p}_{i,t_i})^\gamma \log \hat{p}_{i,t_i}$

with $\gamma=1.5, \alpha=0.8$ .

Confidence penalty $\mathcal{L}_{conf}$ to penalize overconfident predictions:

$\mathcal{L}_{conf} = -\sum_{i=1}^N \sum_{c \in \{0,1\}} \hat{p}_{i,c} \log \hat{p}_{i,c}$

Boundary loss $\mathcal{L}_{bound}$ to up-weight errors near true boundaries:

$\mathcal{L}_{bound} = \sum_{i=1}^N w_i [-y_i \log \hat{p}_{i,1} - (1-y_i)\log \hat{p}_{i,0}]$

where $w_i = 1 + I[\text{distance}(s_i, s_{i+1}) \leq \delta]$ .

The total loss is $\mathcal{L}_{seg} = \mathcal{L}_{focal} + \lambda_1 \mathcal{L}_{conf} + \lambda_2 \mathcal{L}_{bound}$ , with $\lambda_1=0.15, \lambda_2=0.2$ . This design targets both the sparsity and difficulty of boundary events, addressing local discourse phenomena crucial for accurate segmentation (Isidro et al., 7 Jan 2026).

3. Hard Negative Sampling

SegNSP mitigates the sparsity of true segment boundaries using an augmentation strategy that introduces challenging negative samples during training. Each mini-batch ( $N$ ) includes:

70% positive (intra-segment) adjacent pairs,
30% negative (inter-segment) adjacent pairs,
up to 10 "hard negatives" per document, which are non-adjacent sentence pairs $(s_i, s_j)$ with $|i-j|>1$ .

If $N_{neg}=0.3N$ , then the negatives are split as $N_{hard} = \min(10,N_{neg})$ hard negatives from $H(D)$ , the set of non-adjacent pairs, and $N_{adj\_neg} = N_{neg} - N_{hard}$ adjacent true negatives. This approach targets discourse cues and topic discontinuities beyond immediate adjacency, increasing robustness to complex topic transitions (Isidro et al., 7 Jan 2026).

4. Model Architecture, Optimization, and Hyperparameters

SegNSP employs a BERT-base encoder (Portuguese-cased for CitiLink-Minutes, English uncased for WikiSection), followed by a single linear layer mapping $h_{[CLS]}$ to $\mathbb{R}^2$ and softmax for classification. The entire model is fine-tuned with the segmentation-aware loss and uses early stopping based on validation boundary F $_1$ (B-F $_1$ ) score.

Key hyperparameters include:

Learning rate: $5 \times 10^{-6}$
Batch size: 8
Focal loss: $\gamma=1.5, \alpha=0.8$
Confidence penalty: $\lambda_1=0.15$
Boundary loss: $\lambda_2=0.2$
Maximum epochs: 12 (with early stopping)
Boundary decision threshold: $\tau=0.5$ (Isidro et al., 7 Jan 2026)

5. Evaluation Benchmarks and Boundary F $_1$ Metric

Performance is evaluated on two datasets:

WikiSection_en_city: 19,539 English Wikipedia city articles, with 133,642 annotated segments. Preprocessing involves standard sentence tokenization and selection of the en_city partition.
CitiLink-Minutes: 120 Portuguese city council minutes from six municipalities, grouping headings and their textual spans as segments, then sentence-tokenizing the result.

Segmentation accuracy is assessed via the Boundary F $_1$ (B-F $_1$ ) metric. Defining $B$ as the set of true boundary positions and $\hat{B}$ as predicted, precision and recall are:

$P = \frac{|B \cap \hat{B}|}{|\hat{B}|}, \quad R = \frac{|B \cap \hat{B}|}{|B|}$

$\text{B-F}_1 = \frac{2PR}{P+R} = \frac{2|B \cap \hat{B}|}{|B|+|\hat{B}|}$

6. Experimental Results and Comparative Analysis

SegNSP demonstrates substantial improvements over both classical and neural segmentation baselines. The following table summarizes B-F $_1$ scores:

Model	CitiLink-Min. B-F $_1$	WikiSection B-F $_1$
TextTiling	0.15	0.09
Att+CNN	0.34	0.14
TopSeg	0.42	0.48
LumberChunker (LLM)	0.10	0.42
SegNSP	0.79	0.65

On CitiLink-Minutes, SegNSP achieves B-F $_1$ = 0.79, outperforming TopSeg by +0.37.
On WikiSection, SegNSP achieves B-F $_1$ = 0.65, outperforming TopSeg by +0.17.
Additional metrics: for CitiLink-Minutes, $P_k=0.08$ , WD=0.10, B=0.59; for WikiSection, $P_k=0.14$ , WD=0.18, B=0.47.
Statistical significance is established with paired bootstrap, $p < 0.01$ against TopSeg for both datasets.
Cross-municipality generalization (CitiLink leave-one-out) yields B-F $_1$ between 0.24 and 0.77 depending on locality, indicating both robustness and some sensitivity to stylistic variance (Isidro et al., 7 Jan 2026).

7. Implications for Downstream NLP Tasks

SegNSP enhances downstream task performance through high-precision segment boundary induction:

Summarization: Precise boundaries yield coherent segments, reducing topic drift and facilitating passage-level abstraction.
Information Retrieval: Segment-level retrieval units allow for finer indexing, improving passage recall in retrieval-augmented generation pipelines.
Question Answering: Segmented contexts decrease noise in retrieval and generation, leading to more accurate response extraction.

Overall, SegNSP provides a lightweight, label-agnostic, and cross-domain segmentation mechanism suited for diverse NLP pipelines and tasks requiring structured document representations (Isidro et al., 7 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SegNSP: Revisiting Next Sentence Prediction for Linear Text Segmentation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SegNSP.