Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 42 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

BP-Seg: Unsupervised Graphical Text Segmentation

Updated 7 October 2025
  • The paper introduces BP-Seg, an unsupervised segmentation method that leverages a probabilistic graphical model and belief propagation to group sentences based on semantic similarity.
  • It integrates vector-based sentence embeddings and message passing to fuse both contiguous and non-contiguous semantic signals in long texts.
  • Experimental evaluations demonstrate superior clustering quality with ARI 0.76 and NMI 0.89, outperforming traditional methods in capturing thematic structure.

BP-Seg is an unsupervised text segmentation framework that formulates the problem as inference in a graphical model, enabling both contiguous and non-contiguous grouping of sentences based on semantic similarity. By leveraging belief propagation (BP) on a graph whose nodes represent sentences and whose edges encode semantic affinities, BP-Seg captures local coherence and global topic structure without supervision. The methodology integrates probabilistic graphical models, vector-based semantic representations, and message passing to robustly segment complex, long-form documents.

1. Graphical Model Formulation

The core of BP-Seg is a probabilistic graphical model constructed over the sentences {S1,S2,...,Sn}\{S_1, S_2, ..., S_n\} in a document. Each sentence is embedded into a vector space via standard sentence embedding techniques, yielding {R1,...,Rn}\{R_1, ..., R_n\}. Given kk clusters, kk cluster representatives C={C1,...,Ck}C = \{C_1, ..., C_k\} are initialized from randomly sampled sentence embeddings. Each sentence is to be assigned a label xi{1,...,k}x_i \in \{1, ..., k\} indicating its segment.

The model factorizes the joint distribution over assignments as: p(x1,...,xn)=1Zi=1nψi(xi)(i,j)ψi,j(xi,xj)p(x_1, ..., x_n) = \frac{1}{Z} \prod_{i=1}^n \psi_i(x_i) \prod_{(i,j)} \psi_{i,j}(x_i, x_j) where ZZ is a partition function, the node potentials ψi(xi)=exp(sim(Ri,Cxi))\psi_i(x_i) = \exp(\operatorname{sim}(R_i, C_{x_i})) reward sentences similar to their cluster centroid, and the pairwise potentials

ψi,j(xi,xj)={1,if xi=xj exp(λ(sim(Ri,Rj)1)),otherwise\psi_{i,j}(x_i, x_j) = \begin{cases} 1, & \text{if}\ x_i = x_j \ \exp(\lambda (\operatorname{sim}(R_i, R_j) - 1)), & \text{otherwise} \end{cases}

express compatibility between assignments; sim(,)\operatorname{sim}(\cdot,\cdot) is the cosine similarity and λ\lambda controls penalty decay. This structure unifies contiguous and non-contiguous segmentation by connecting all sentence pairs with strengths proportional to semantic similarity.

2. Message Passing via Belief Propagation

Belief propagation is applied to the graphical model to perform unsupervised segmentation. Each sentence node ii communicates messages to its neighbors jj, with the message at iteration tt being: mij(t)(xj)=xiψi(xi)ψi,j(xi,xj)kjmki(t1)(xi)m_{i \to j}^{(t)}(x_j) = \sum_{x_i} \psi_i(x_i) \psi_{i,j}(x_i, x_j) \prod_{k \neq j} m_{k \to i}^{(t-1)}(x_i) Messages are initialized uniformly, reflecting no prior preference across clusters. After a fixed number of iterations or upon convergence, node beliefs are computed: bi(xi)ψi(xi)jimji(xi)b_i(x_i) \propto \psi_i(x_i) \prod_{j \neq i} m_{j \to i}(x_i) and the segment assignment for sentence SiS_i is xi=argmaxxibi(xi)x_i^* = \arg\max_{x_i} b_i(x_i). This message-passing mechanism allows sentences to integrate local coherence (from neighbors) and global semantic similarity (via non-local edges) when selecting segment labels.

3. Modeling Local Coherence and Non-Contiguous Semantics

BP-Seg is distinguished by its unified treatment of local and non-local coherence. Adjacency is a special case: nearby sentences typically have higher semantic similarity and fast-decaying edge weights, thus the pairwise factors reinforce contiguity when justified. Non-contiguous semantic similarity is naturally supported: if two sentences SiS_i and SjS_j are distant but share high cosine similarity, the pairwise factor ψi,j(xi,xj)\psi_{i,j}(x_i, x_j) contributes strongly to assigning them to the same segment.

The framework thus relaxes the traditional constraint of contiguous segmentation, enabling grouping of semantically coherent but scattered sentences—a property not generally exhibited by standard approaches like GraphSeg or k-means clustering.

4. Experimental Evaluation

BP-Seg was evaluated both qualitatively and quantitatively. In an illustrative example, for several choices of kk, BP-Seg consistently grouped together thematically related sentences (e.g., "tennis-related" ones) even when non-adjacent; by contrast, baseline methods favor contiguous clusters and typically failed to recover such groupings.

Quantitative results on the Choi corpus used Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI), adapted for non-contiguous segmentation. On subsets with sentences of length 6–8, BP-Seg achieved ARI 0.76 and NMI 0.89, outperforming GraphSeg and k-means in both stability and clustering quality. This demonstrates that BP-Seg's message-passing inference is effective at fusing distributed semantic signals across long documents, resulting in segmentations that better align with true thematic structure.

5. Practical Methodology and Implementation

The method proceeds as follows:

  1. Embed each sentence to obtain RiR_i.
  2. Randomly select kk embeddings as initial cluster representatives CjC_j.
  3. Build a fully (or selectively) connected graph with node and pairwise potentials as above.
  4. Initialize messages mijm_{i\to j} uniformly.
  5. Iteratively update messages according to the BP rule.
  6. At convergence, assign each sentence to the segment maximizing its marginal belief.

The process requires computing cosine similarities between sentences and between sentences and cluster centers. The pairwise potential parameters can be tuned via λ\lambda, and kk is a model hyperparameter. The method does not depend on supervision and can be applied to arbitrary documents and embedding schemes.

6. Application Domains

BP-Seg's ability to perform unsupervised, non-contiguous, and semantically-aware segmentation enables various downstream applications:

  • Information retrieval: Enhanced flexible indexing via grouping of semantically related but scattered sentences.
  • Document summarization: Improved extraction of thematically coherent content blocks.
  • Prompt pruning for LLMs: Identification of relevant prompt segments for context-efficient LLM input.
  • Question answering and disclosure analysis: Precise localization of relevant answer or disclosure units, even when not contiguous.

The modular, graphical model-based approach (Editor's term) also supports integration into larger NLP pipelines, allowing for further probabilistic modeling or adaptation to domain-specific constraints.

7. Significance and Limitations

BP-Seg advances text segmentation by providing a principled graphical framework that unifies local and non-local semantic information through belief propagation. Experimental evidence demonstrates robustness on both artificial and real-world benchmarks, particularly in non-contiguous scenarios where traditional methods are limited.

Limitations include sensitivity to embedding quality, the need to set hyperparameters (notably kk and λ\lambda), and potential scalability constraints for very long documents if the graph is fully connected. Heuristic sparsification or efficient message-passing approximations may mitigate some computational concerns. A plausible implication is that domain-specific adaptation (e.g., pretraining embeddings) would further strengthen segmentation performance in specialized corpora.


BP-Seg exemplifies the fusion of probabilistic graphical modeling and neural semantic representations for unsupervised text segmentation, offering new capabilities for semantically structured document analysis (Li et al., 22 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to BP-Seg.