RST-LoRA: Lightweight EDU Segmentation

Updated 22 May 2026

RST-LoRA Architecture is a lightweight discourse segmentation system that defines EDUs as minimal discourse units essential for constructing RST trees.
It utilizes a lexically-enhanced random forest classifier with a nine-token context window to accurately delineate EDU boundaries through combined lexical and character features.
The method achieves high precision and recall on standard datasets, outperforming dense neural architectures while maintaining computational efficiency.

Elementary Discourse Units (EDUs) are the minimal discourse segments—typically clauses or clause-like spans—that serve as leaf nodes in Rhetorical Structure Theory (RST) discourse trees. They form the atomic units over which discourse relations (e.g., Contrast, Elaboration) are defined, with accurate segmentation being a critical prerequisite for reliable discourse parsing and downstream applications such as information extraction, summarization, and argument mining. Automatic and efficient identification of EDU boundaries has traditionally required complex linguistic features or heavy neural architectures, motivating interest in lightweight yet robust alternatives.

1. Formal Definition and Role of EDUs in Discourse Structure

An Elementary Discourse Unit (EDU) is defined as the minimal contiguous text span—most often a clause or clause-like segment—that serves as a leaf in an RST discourse tree. In the RST formalism, a document is a hierarchical, binary tree where each leaf corresponds to an EDU, and internal nodes represent spans connected by rhetorical relations. Each relation assigns one segment as the nucleus (central content) and the other as the satellite (supporting material). The segmentation of a document into EDUs is a precursor to all subsequent discourse parsing decisions; errors in segmentation propagate to higher levels, degrading both relation classification and tree structure prediction. Therefore, the accuracy and efficiency of EDU boundary detection directly constrain the upper bound on overall discourse analysis performance (Sediqin et al., 13 Jan 2025).

2. The ESURF Methodology: Lexicalized Random Forest Segmentation

ESURF (EDU Segmentation Using Random Forests) operationalizes EDU boundary detection as a local binary classification problem at each candidate boundary location (before every token $t_i$ ). For each $t_i$ , the model considers a nine-token local window spanning tokens $(t_{i-3},...,t_{i+5})$ . These are positionally labeled as “Before,” “Leading,” and “Continuing” to encode local context.

The feature representation concatenates two components:

Lexical features: For each token in the window, the surface form is embedded (via one-hot or low-dimensional hashing), forming a position-aware context vector $L_i$ .
Character n-gram features: For each token, character n-grams of length 2–4 are extracted. Only informative n-grams—those occurring in multiple but not all documents—are retained, and encoded as position-specific binary indicators $C_i$ .

The input vector for boundary candidate $i$ is $x_i = [L_i ; C_i] \in \{0,1\}^d$ , where $d$ is the total number of feature dimensions. The random forest (RF) consists of $T$ trees trained to predict the label $y_i \in \{0,1\}$ (1 if there is an EDU boundary before $t_i$ 0, 0 otherwise) using the Gini impurity criterion for splits. Inference uses majority voting across trees for the final prediction: $t_i$ 1 Recommended hyperparameters are $t_i$ 2 trees, maximum depth $t_i$ 3, minimum samples per leaf $t_i$ 4, and split criterion $t_i$ 5 Gini impurity (Sediqin et al., 13 Jan 2025).

3. Empirical Performance and Benchmark Evaluation

ESURF was evaluated on two primary datasets:

RST Discourse Treebank (RST-DT): 347 training and 38 test articles; 10% of training held out for development.
CNN/DailyMail: $t_i$ 6300,000 news articles; balanced sampling ensures 50% positive and 50% negative boundary windows for training/evaluation.

Performance metrics include precision (P), recall (R), and $t_i$ 7, computed as usual: $t_i$ 8 Key results:

On CNN/DailyMail:
- ESURF: Accuracy = 0.912, P = 0.894, R = 0.921, $t_i$ 9
- BERT-based segmenter: P = 0.888, $(t_{i-3},...,t_{i+5})$ 0
- XLNet: $(t_{i-3},...,t_{i+5})$ 1
On RST-DT:
- ESURF: Accuracy = 0.950, P = 0.935, R = 0.979, $(t_{i-3},...,t_{i+5})$ 2
- Best prior: $(t_{i-3},...,t_{i+5})$ 3
- Thus, ESURF achieves a +0.3 $(t_{i-3},...,t_{i+5})$ 4 absolute improvement on RST-DT over the best prior method (Sediqin et al., 13 Jan 2025).

When integrated into a state-of-the-art shift-reduce neural RST parser (Yu et al., 2022), ESURF led to incremental improvements in overall parsing metrics, including a +1.3 $(t_{i-3},...,t_{i+5})$ 5 gain on relation prediction.

4. Feature Ablation and Linguistic Analysis

Ablation experiments demonstrate:

Removal of character n-grams: Reduces RST-DT $(t_{i-3},...,t_{i+5})$ 6 from 0.958 to $(t_{i-3},...,t_{i+5})$ 70.953, showing a 0.5-point drop attributed to loss of morphological boundary cues (especially for recall in inflection-marked boundaries).
Removal of lexical context: Causes a catastrophic $(t_{i-3},...,t_{i+5})$ 8 $(t_{i-3},...,t_{i+5})$ 9 point collapse, confirming that token identity in local context is essential for accurate segmentation.

The effectiveness of the simple feature set is attributed to:

Local lexical cues (discourse connectives: e.g., "however", "because")
Morphological markers (punctuation, verb suffixes) The nine-token window empirically balances richness of local context with signal clarity, enabling the RF classifier to model both frequent and rare boundary patterns without overfitting or diluting informative features (Sediqin et al., 13 Jan 2025).

5. Limitations and Prospective Extensions

Identified limitations include:

Evaluation restricted to RST-DT and CNN/DailyMail; no current validation on PDTB or cross-genre datasets (e.g., GUM).
Gold-standard sentence segmentation is assumed; actual pipeline robustness against automatic sentence splitting is untested.
Hyperparameter and feature tuning is manual; future work could use Bayesian optimization.
Extension to semi-supervised learning on unlabeled corpora (e.g., GUM) is envisaged to develop lightweight, adaptable EDU segmenters for low-resource languages and reduce dependence on labor-intensive treebanking.

A plausible implication is that the ESURF approach, which exploits lightweight lexical and morphological features, is not only statistically robust but also computationally efficient, making it suitable for large-scale or low-resource discourse processing contexts (Sediqin et al., 13 Jan 2025).

6. Contextual Significance within Discourse Segmentation Research

ESURF signifies a methodological shift toward minimalist, training-efficient discourse segmentation architectures, contrasting with both heavyweight neural approaches and linguistically intensive rule-based systems. Its demonstrated ability to outperform deep transformer baselines such as BERT and XLNet underlines the centrality of local lexical and morphological cues in discourse segmentation—a finding consistent with broader linguistic evidence across languages and genres. The modular, language-independent nature of the feature set encourages practical adaptation to new domains, including news, biomedical, and low-resource languages. The method also provides a strong baseline for integration into more sophisticated discourse parsing or downstream applications, such as automatic summarization, question answering, and argument mining, where fine-grained and accurate EDU segmentation is foundational (Sediqin et al., 13 Jan 2025).

Markdown Report Issue Upgrade to Chat

References (1)

ESURF: Simple and Effective EDU Segmentation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RST-LoRA Architecture.