SegNSP: Neural Text Segmentation
- SegNSP is a neural approach that reframes linear text segmentation as a next sentence prediction task to detect topic transitions without explicit labels.
- It leverages a BERT-based encoder and a segmentation-aware loss that integrates focal loss, confidence penalty, and boundary loss to address class imbalance and boundary sparsity.
- SegNSP achieves superior Boundary F1 scores on public datasets, enhancing downstream tasks like summarization, information retrieval, and question answering.
SegNSP is a neural approach to linear text segmentation in NLP that formulates the segmentation task as a next sentence prediction (NSP) problem. It leverages input representations and learning objectives specifically tailored to identifying segment boundaries, such as topic transitions, without the need for explicit topic labels or taxonomies. SegNSP achieves state-of-the-art results on public English and Portuguese segmentation benchmarks, demonstrating significant improvements over classical and neural baselines and offering robust, label-agnostic performance for segmenting continuous text into coherent, semantically meaningful units (Isidro et al., 7 Jan 2026).
1. Linear Text Segmentation as Next Sentence Prediction
SegNSP approaches linear text segmentation by explicitly modeling sentence-to-sentence continuity using the NSP formalism. Given a document split into sentences and a segmentation consisting of contiguous segments, a segment boundary is defined to exist between sentences and if they belong to different segments. For each adjacent sentence pair , the model constructs the input representation as , and encodes it with a pretrained BERT model to obtain .
A linear classification head projects to two logits, applying softmax to yield the probability distribution , where . During inference, a boundary is predicted at position if , with threshold tuned on validation data (Isidro et al., 7 Jan 2026).
2. Label-Agnostic NSP Formulation and Segmentation-Aware Loss
SegNSP uses a label-agnostic variant of next sentence prediction. Each sentence pair receives a positive label () if the next sentence continues the same topic, and negative () at a topic boundary. No explicit topic labels or external taxonomies are required—only binary next/boundary information.
The segmentation-aware loss combines three components:
- Focal loss to address class imbalance:
with .
- Confidence penalty to penalize overconfident predictions:
- Boundary loss to up-weight errors near true boundaries:
where .
The total loss is , with . This design targets both the sparsity and difficulty of boundary events, addressing local discourse phenomena crucial for accurate segmentation (Isidro et al., 7 Jan 2026).
3. Hard Negative Sampling
SegNSP mitigates the sparsity of true segment boundaries using an augmentation strategy that introduces challenging negative samples during training. Each mini-batch () includes:
- 70% positive (intra-segment) adjacent pairs,
- 30% negative (inter-segment) adjacent pairs,
- up to 10 "hard negatives" per document, which are non-adjacent sentence pairs with .
If , then the negatives are split as hard negatives from , the set of non-adjacent pairs, and adjacent true negatives. This approach targets discourse cues and topic discontinuities beyond immediate adjacency, increasing robustness to complex topic transitions (Isidro et al., 7 Jan 2026).
4. Model Architecture, Optimization, and Hyperparameters
SegNSP employs a BERT-base encoder (Portuguese-cased for CitiLink-Minutes, English uncased for WikiSection), followed by a single linear layer mapping to and softmax for classification. The entire model is fine-tuned with the segmentation-aware loss and uses early stopping based on validation boundary F (B-F) score.
Key hyperparameters include:
- Learning rate:
- Batch size: 8
- Focal loss:
- Confidence penalty:
- Boundary loss:
- Maximum epochs: 12 (with early stopping)
- Boundary decision threshold: (Isidro et al., 7 Jan 2026)
5. Evaluation Benchmarks and Boundary F Metric
Performance is evaluated on two datasets:
- WikiSection_en_city: 19,539 English Wikipedia city articles, with 133,642 annotated segments. Preprocessing involves standard sentence tokenization and selection of the en_city partition.
- CitiLink-Minutes: 120 Portuguese city council minutes from six municipalities, grouping headings and their textual spans as segments, then sentence-tokenizing the result.
Segmentation accuracy is assessed via the Boundary F (B-F) metric. Defining as the set of true boundary positions and as predicted, precision and recall are:
6. Experimental Results and Comparative Analysis
SegNSP demonstrates substantial improvements over both classical and neural segmentation baselines. The following table summarizes B-F scores:
| Model | CitiLink-Min. B-F | WikiSection B-F |
|---|---|---|
| TextTiling | 0.15 | 0.09 |
| Att+CNN | 0.34 | 0.14 |
| TopSeg | 0.42 | 0.48 |
| LumberChunker (LLM) | 0.10 | 0.42 |
| SegNSP | 0.79 | 0.65 |
- On CitiLink-Minutes, SegNSP achieves B-F = 0.79, outperforming TopSeg by +0.37.
- On WikiSection, SegNSP achieves B-F = 0.65, outperforming TopSeg by +0.17.
- Additional metrics: for CitiLink-Minutes, , WD=0.10, B=0.59; for WikiSection, , WD=0.18, B=0.47.
- Statistical significance is established with paired bootstrap, against TopSeg for both datasets.
- Cross-municipality generalization (CitiLink leave-one-out) yields B-F between 0.24 and 0.77 depending on locality, indicating both robustness and some sensitivity to stylistic variance (Isidro et al., 7 Jan 2026).
7. Implications for Downstream NLP Tasks
SegNSP enhances downstream task performance through high-precision segment boundary induction:
- Summarization: Precise boundaries yield coherent segments, reducing topic drift and facilitating passage-level abstraction.
- Information Retrieval: Segment-level retrieval units allow for finer indexing, improving passage recall in retrieval-augmented generation pipelines.
- Question Answering: Segmented contexts decrease noise in retrieval and generation, leading to more accurate response extraction.
Overall, SegNSP provides a lightweight, label-agnostic, and cross-domain segmentation mechanism suited for diverse NLP pipelines and tasks requiring structured document representations (Isidro et al., 7 Jan 2026).