- The paper introduces S2 Chunking, a hybrid framework that combines spatial (layout) and semantic (text embedding) analysis using graph clustering for improved document segmentation.
- Results show S2 Chunking achieves higher cohesion (0.92) and layout consistency (0.88) compared to semantic-only methods (cohesion ~0.8, layout ~0.5) and other baselines on test datasets.
- This framework is suitable for large language model applications, such as retrieval-augmented generation, by producing semantically and spatially coherent chunks that respect token constraints.
The paper introduces a hybrid framework, termed S2 Chunking, for document segmentation, integrating spatial and semantic analyses to enhance the cohesion and accuracy of document chunks. The approach addresses the limitations of traditional methods that often rely solely on semantic analysis, overlooking the importance of spatial layout in understanding relationships within complex documents.
The core innovation lies in leveraging bounding box (bbox) information and text embeddings to construct a weighted graph representation of document elements, which is then clustered using spectral clustering. This method ensures that chunks are both semantically coherent and spatially consistent. The framework also incorporates a dynamic clustering mechanism that respects token length constraints, making it suitable for applications with input size limitations, such as LLMs for retrieval-augmented generation (RAG).
The paper discusses several existing document chunking methods:
- Fixed-Size Chunking: This simple method divides text into chunks of a predefined size s, without considering the content or structure. The set of chunks C is defined as:
C={T[i⋅s:(i+1)⋅s]∣i=0,1,…,⌊s∣T∣⌋}
where ∣T∣ represents the total length of the text. An overlap parameter o can be introduced to create overlapping chunks:
C={T[i⋅(s−o):i⋅(s−o)+s]∣i=0,1,…,⌊s−o∣T∣−s⌋}
- C: Set of chunks
- T: Input text
- s: Predefined size
- ∣T∣: Total length of text
- i: Index
- o: Overlap parameter
- Recursive Chunking: This method divides text hierarchically using a set of separators S={s1,s2,…,sn}. The recursive chunking process is defined as:
C=RecursiveSplit(T,S)
where:
RecursiveSplit(T,S)={Tamp;if ∣T∣≤s ⋃si∈SRecursiveSplit(Tk,S)amp;otherwise*C: Set of chunks
- T: Input text
- S: Set of separators
- si: Separator
- Tk: Substrings obtained by splitting T using the separator si
- Semantic Chunking: This method uses text embeddings to group semantically related content. The similarity between two embeddings ei and ej is computed using cosine similarity:
sim(ei,ej)=∥ei∥∥ej∥ei⋅ej
The chunking process is defined as:
C={Tk∣sim(E(Tk),E(Tk+1))≥τ}
- ei: Embedding
- ej: Embedding
- sim(ei,ej): Similarity between embeddings ei and ej
- E: Embedding function
- Tk: Text segment
- τ: Similarity threshold
The methodology involves region detection and layout ordering, followed by graph construction, weight calculation, and clustering. The document is represented as a graph G=(V,E), where V is the set of nodes corresponding to document elements, and E is the set of edges representing relationships between these elements. Edge weights are calculated using a combination of spatial and semantic information. Spatial weights are calculated using the Euclidean distance between bounding box centroids:
wspatial(i,j)=1+d(i,j)1
- wspatial(i,j): Spatial weight between elements i and j
- d(i,j): Distance between centroids of elements i and j
Semantic weights are computed using text embeddings from a pre-trained LLM:
wsemantic(i,j)=cosine_similarity(embedding(i),embedding(j))
- wsemantic(i,j): Semantic weight between elements i and j
The final edge weights are the average of spatial and semantic weights:
wcombined(i,j)=2wspatial(i,j)+wsemantic(i,j)
- wcombined(i,j): Combined weight between elements i and j
The graph is then partitioned into cohesive chunks using spectral clustering.
The authors evaluated the approach on datasets from PubMed and arXiv, selected for their diversity in content, layout, and domain-specific complexity. The performance metrics include:
- Cohesion Score: Measures the semantic coherence of chunks using the average pairwise cosine similarity of text embeddings within each chunk.
- Layout Consistency Score: Measures the spatial consistency of chunks using the average pairwise proximity of bounding boxes within each chunk.
- Purity: Measures how well chunks align with ground truth categories.
- Normalized Mutual Information (NMI): Measures the agreement between chunking results and ground truth labels.
The S2 Chunking method achieved a cohesion score of 0.85 and a layout consistency score of 0.82 on the PubMed dataset, and a cohesion score of 0.88 and a layout consistency score of 0.85 on the arXiv dataset. These results indicate that the proposed hybrid approach outperforms baseline methods such as fixed-size chunking, recursive chunking, and semantic chunking. For instance, semantic chunking achieved high cohesion scores (0.80 and 0.82) but lower layout consistency scores (0.50 and 0.55), highlighting the advantage of integrating spatial information. A table in the paper shows that the S2 Chunking method achieves a Cohesion Score of 0.92, Layout Consistency Score of 0.88, Purity of 0.96, and NMI of 0.93, which are all higher than the comparison methods.