BP-Seg: Unsupervised Graphical Text Segmentation
- The paper introduces BP-Seg, an unsupervised segmentation method that leverages a probabilistic graphical model and belief propagation to group sentences based on semantic similarity.
- It integrates vector-based sentence embeddings and message passing to fuse both contiguous and non-contiguous semantic signals in long texts.
- Experimental evaluations demonstrate superior clustering quality with ARI 0.76 and NMI 0.89, outperforming traditional methods in capturing thematic structure.
BP-Seg is an unsupervised text segmentation framework that formulates the problem as inference in a graphical model, enabling both contiguous and non-contiguous grouping of sentences based on semantic similarity. By leveraging belief propagation (BP) on a graph whose nodes represent sentences and whose edges encode semantic affinities, BP-Seg captures local coherence and global topic structure without supervision. The methodology integrates probabilistic graphical models, vector-based semantic representations, and message passing to robustly segment complex, long-form documents.
1. Graphical Model Formulation
The core of BP-Seg is a probabilistic graphical model constructed over the sentences in a document. Each sentence is embedded into a vector space via standard sentence embedding techniques, yielding . Given clusters, cluster representatives are initialized from randomly sampled sentence embeddings. Each sentence is to be assigned a label indicating its segment.
The model factorizes the joint distribution over assignments as: where is a partition function, the node potentials reward sentences similar to their cluster centroid, and the pairwise potentials
express compatibility between assignments; is the cosine similarity and controls penalty decay. This structure unifies contiguous and non-contiguous segmentation by connecting all sentence pairs with strengths proportional to semantic similarity.
2. Message Passing via Belief Propagation
Belief propagation is applied to the graphical model to perform unsupervised segmentation. Each sentence node communicates messages to its neighbors , with the message at iteration being: Messages are initialized uniformly, reflecting no prior preference across clusters. After a fixed number of iterations or upon convergence, node beliefs are computed: and the segment assignment for sentence is . This message-passing mechanism allows sentences to integrate local coherence (from neighbors) and global semantic similarity (via non-local edges) when selecting segment labels.
3. Modeling Local Coherence and Non-Contiguous Semantics
BP-Seg is distinguished by its unified treatment of local and non-local coherence. Adjacency is a special case: nearby sentences typically have higher semantic similarity and fast-decaying edge weights, thus the pairwise factors reinforce contiguity when justified. Non-contiguous semantic similarity is naturally supported: if two sentences and are distant but share high cosine similarity, the pairwise factor contributes strongly to assigning them to the same segment.
The framework thus relaxes the traditional constraint of contiguous segmentation, enabling grouping of semantically coherent but scattered sentences—a property not generally exhibited by standard approaches like GraphSeg or k-means clustering.
4. Experimental Evaluation
BP-Seg was evaluated both qualitatively and quantitatively. In an illustrative example, for several choices of , BP-Seg consistently grouped together thematically related sentences (e.g., "tennis-related" ones) even when non-adjacent; by contrast, baseline methods favor contiguous clusters and typically failed to recover such groupings.
Quantitative results on the Choi corpus used Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI), adapted for non-contiguous segmentation. On subsets with sentences of length 6–8, BP-Seg achieved ARI 0.76 and NMI 0.89, outperforming GraphSeg and k-means in both stability and clustering quality. This demonstrates that BP-Seg's message-passing inference is effective at fusing distributed semantic signals across long documents, resulting in segmentations that better align with true thematic structure.
5. Practical Methodology and Implementation
The method proceeds as follows:
- Embed each sentence to obtain .
- Randomly select embeddings as initial cluster representatives .
- Build a fully (or selectively) connected graph with node and pairwise potentials as above.
- Initialize messages uniformly.
- Iteratively update messages according to the BP rule.
- At convergence, assign each sentence to the segment maximizing its marginal belief.
The process requires computing cosine similarities between sentences and between sentences and cluster centers. The pairwise potential parameters can be tuned via , and is a model hyperparameter. The method does not depend on supervision and can be applied to arbitrary documents and embedding schemes.
6. Application Domains
BP-Seg's ability to perform unsupervised, non-contiguous, and semantically-aware segmentation enables various downstream applications:
- Information retrieval: Enhanced flexible indexing via grouping of semantically related but scattered sentences.
- Document summarization: Improved extraction of thematically coherent content blocks.
- Prompt pruning for LLMs: Identification of relevant prompt segments for context-efficient LLM input.
- Question answering and disclosure analysis: Precise localization of relevant answer or disclosure units, even when not contiguous.
The modular, graphical model-based approach (Editor's term) also supports integration into larger NLP pipelines, allowing for further probabilistic modeling or adaptation to domain-specific constraints.
7. Significance and Limitations
BP-Seg advances text segmentation by providing a principled graphical framework that unifies local and non-local semantic information through belief propagation. Experimental evidence demonstrates robustness on both artificial and real-world benchmarks, particularly in non-contiguous scenarios where traditional methods are limited.
Limitations include sensitivity to embedding quality, the need to set hyperparameters (notably and ), and potential scalability constraints for very long documents if the graph is fully connected. Heuristic sparsification or efficient message-passing approximations may mitigate some computational concerns. A plausible implication is that domain-specific adaptation (e.g., pretraining embeddings) would further strengthen segmentation performance in specialized corpora.
BP-Seg exemplifies the fusion of probabilistic graphical modeling and neural semantic representations for unsupervised text segmentation, offering new capabilities for semantically structured document analysis (Li et al., 22 May 2025).