Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 158 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 112 tok/s Pro

Kimi K2 177 tok/s Pro

GPT OSS 120B 452 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

Text Segmentation by Cross Segment Attention (2004.14535v2)

Published 30 Apr 2020 in cs.CL

Abstract: Document and discourse segmentation are two fundamental NLP tasks pertaining to breaking up text into constituents, which are commonly used to help downstream tasks such as information retrieval or text summarization. In this work, we propose three transformer-based architectures and provide comprehensive comparisons with previously proposed approaches on three standard datasets. We establish a new state-of-the-art, reducing in particular the error rates by a large margin in all cases. We further analyze model sizes and find that we can build models with many fewer parameters while keeping good performance, thus facilitating real-world applications.

Citations (76)

View on Semantic Scholar

Summary

The paper introduces transformer-based architectures for text segmentation, demonstrating that local context can effectively identify semantic shifts.
It details three models—Cross-Segment BERT, BERT+Bi-LSTM, and Hierarchical BERT—each leveraging different strategies for encoding context.
Evaluation on standard datasets shows these models outperform previous methods, balancing performance with reduced computational load.

Overview

The paper "Text Segmentation by Cross Segment Attention" presents three transformer-based architectures for text segmentation, addressing both document and discourse segmentation tasks. These models achieve significant performance improvements over existing methods across standard datasets while maintaining fewer parameters, offering potential practical benefits.

Architectures

Cross-Segment BERT

The Cross-Segment BERT model leverages local context around candidate segment breaks. It uses a simple architecture that encodes sequences around a potential break, using BERT's positional embeddings and token representations. This method, while simplistic, achieves surprisingly competitive results, suggesting that subtle semantic shifts can be detected using only local context.

Figure 1: Our proposed segmentation models, illustrating the document segmentation task.

BERT+Bi-LSTM

This model combines BERT with a Bi-LSTM layer, where each sentence is encoded independently by BERT, and the resulting sentence representations are processed by a Bi-LSTM. This hierarchical approach captures sequential dependencies while ensuring efficient modeling of sentence sequences within documents.

Hierarchical BERT

Hierarchical BERT introduces a document-level encoding layer built on BERT encodings of individual sentences. This architecture fully leverages the document context to achieve high-level coherence in segmentation predictions, especially for tasks involving document-level continuity.

Evaluation and Results

The proposed models were evaluated on Wiki-727K, Choi, and RST-DT datasets. Results demonstrated that these models outperformed previous state-of-the-art approaches. Notably, Cross-Segment BERT achieved strong results using local context, challenging assumptions about the necessity of full document context for segmentation tasks.

Model	Wiki-727K F1	RST-DT F1	Choi P $_k$
Cross-Segment BERT	66.0	95.0	0.07
BERT+Bi-LSTM	59.9	95.2	0.17
Hierarchical BERT	66.5	95.2	0.38

Analysis and Ablation Studies

Context Length

Further analysis revealed the significance of context size for Cross-Segment BERT. Extending context length affected performance, with diminishing returns beyond a certain point. An ablation paper isolated the impact of leading versus trailing context, confirming the necessity of viewing contexts on both sides of a candidate break.

Figure 2: Analysis of the importance of the right context length (solid red line). Dashed blue line denotes the hierarchical Bi-LSTM baseline encoding the full context.

Model Size

Evaluations of model configurations highlighted a trade-off between model size and performance. Distillation methods demonstrated potential in reducing model size without significant loss of performance, thereby improving computational efficiency and applicability in real-world scenarios.

Conclusion

The paper successfully introduces robust models for text segmentation using transformer architectures, with Cross-Segment BERT showing that local context can be leveraged effectively. The findings propose a shift towards simpler models in certain segmentation tasks, encouraging future research to explore these architectures with consideration of context complexity and computational constraints.