Papers
Topics
Authors
Recent
Search
2000 character limit reached

RST-LoRA: Lightweight EDU Segmentation

Updated 22 May 2026
  • RST-LoRA Architecture is a lightweight discourse segmentation system that defines EDUs as minimal discourse units essential for constructing RST trees.
  • It utilizes a lexically-enhanced random forest classifier with a nine-token context window to accurately delineate EDU boundaries through combined lexical and character features.
  • The method achieves high precision and recall on standard datasets, outperforming dense neural architectures while maintaining computational efficiency.

Elementary Discourse Units (EDUs) are the minimal discourse segments—typically clauses or clause-like spans—that serve as leaf nodes in Rhetorical Structure Theory (RST) discourse trees. They form the atomic units over which discourse relations (e.g., Contrast, Elaboration) are defined, with accurate segmentation being a critical prerequisite for reliable discourse parsing and downstream applications such as information extraction, summarization, and argument mining. Automatic and efficient identification of EDU boundaries has traditionally required complex linguistic features or heavy neural architectures, motivating interest in lightweight yet robust alternatives.

1. Formal Definition and Role of EDUs in Discourse Structure

An Elementary Discourse Unit (EDU) is defined as the minimal contiguous text span—most often a clause or clause-like segment—that serves as a leaf in an RST discourse tree. In the RST formalism, a document is a hierarchical, binary tree where each leaf corresponds to an EDU, and internal nodes represent spans connected by rhetorical relations. Each relation assigns one segment as the nucleus (central content) and the other as the satellite (supporting material). The segmentation of a document into EDUs is a precursor to all subsequent discourse parsing decisions; errors in segmentation propagate to higher levels, degrading both relation classification and tree structure prediction. Therefore, the accuracy and efficiency of EDU boundary detection directly constrain the upper bound on overall discourse analysis performance (Sediqin et al., 13 Jan 2025).

2. The ESURF Methodology: Lexicalized Random Forest Segmentation

ESURF (EDU Segmentation Using Random Forests) operationalizes EDU boundary detection as a local binary classification problem at each candidate boundary location (before every token tit_i). For each tit_i, the model considers a nine-token local window spanning tokens (ti3,...,ti+5)(t_{i-3},...,t_{i+5}). These are positionally labeled as “Before,” “Leading,” and “Continuing” to encode local context.

The feature representation concatenates two components:

  • Lexical features: For each token in the window, the surface form is embedded (via one-hot or low-dimensional hashing), forming a position-aware context vector LiL_i.
  • Character n-gram features: For each token, character n-grams of length 2–4 are extracted. Only informative n-grams—those occurring in multiple but not all documents—are retained, and encoded as position-specific binary indicators CiC_i.

The input vector for boundary candidate ii is xi=[Li;Ci]{0,1}dx_i = [L_i ; C_i] \in \{0,1\}^d, where dd is the total number of feature dimensions. The random forest (RF) consists of TT trees trained to predict the label yi{0,1}y_i \in \{0,1\} (1 if there is an EDU boundary before tit_i0, 0 otherwise) using the Gini impurity criterion for splits. Inference uses majority voting across trees for the final prediction: tit_i1 Recommended hyperparameters are tit_i2 trees, maximum depth tit_i3, minimum samples per leaf tit_i4, and split criterion tit_i5 Gini impurity (Sediqin et al., 13 Jan 2025).

3. Empirical Performance and Benchmark Evaluation

ESURF was evaluated on two primary datasets:

  • RST Discourse Treebank (RST-DT): 347 training and 38 test articles; 10% of training held out for development.
  • CNN/DailyMail: tit_i6300,000 news articles; balanced sampling ensures 50% positive and 50% negative boundary windows for training/evaluation.

Performance metrics include precision (P), recall (R), and tit_i7, computed as usual: tit_i8 Key results:

  • On CNN/DailyMail:
    • ESURF: Accuracy = 0.912, P = 0.894, R = 0.921, tit_i9
    • BERT-based segmenter: P = 0.888, (ti3,...,ti+5)(t_{i-3},...,t_{i+5})0
    • XLNet: (ti3,...,ti+5)(t_{i-3},...,t_{i+5})1
  • On RST-DT:
    • ESURF: Accuracy = 0.950, P = 0.935, R = 0.979, (ti3,...,ti+5)(t_{i-3},...,t_{i+5})2
    • Best prior: (ti3,...,ti+5)(t_{i-3},...,t_{i+5})3
    • Thus, ESURF achieves a +0.3 (ti3,...,ti+5)(t_{i-3},...,t_{i+5})4 absolute improvement on RST-DT over the best prior method (Sediqin et al., 13 Jan 2025).

When integrated into a state-of-the-art shift-reduce neural RST parser (Yu et al., 2022), ESURF led to incremental improvements in overall parsing metrics, including a +1.3 (ti3,...,ti+5)(t_{i-3},...,t_{i+5})5 gain on relation prediction.

4. Feature Ablation and Linguistic Analysis

Ablation experiments demonstrate:

  • Removal of character n-grams: Reduces RST-DT (ti3,...,ti+5)(t_{i-3},...,t_{i+5})6 from 0.958 to (ti3,...,ti+5)(t_{i-3},...,t_{i+5})70.953, showing a 0.5-point drop attributed to loss of morphological boundary cues (especially for recall in inflection-marked boundaries).
  • Removal of lexical context: Causes a catastrophic (ti3,...,ti+5)(t_{i-3},...,t_{i+5})8 (ti3,...,ti+5)(t_{i-3},...,t_{i+5})9 point collapse, confirming that token identity in local context is essential for accurate segmentation.

The effectiveness of the simple feature set is attributed to:

  • Local lexical cues (discourse connectives: e.g., "however", "because")
  • Morphological markers (punctuation, verb suffixes) The nine-token window empirically balances richness of local context with signal clarity, enabling the RF classifier to model both frequent and rare boundary patterns without overfitting or diluting informative features (Sediqin et al., 13 Jan 2025).

5. Limitations and Prospective Extensions

Identified limitations include:

  • Evaluation restricted to RST-DT and CNN/DailyMail; no current validation on PDTB or cross-genre datasets (e.g., GUM).
  • Gold-standard sentence segmentation is assumed; actual pipeline robustness against automatic sentence splitting is untested.
  • Hyperparameter and feature tuning is manual; future work could use Bayesian optimization.
  • Extension to semi-supervised learning on unlabeled corpora (e.g., GUM) is envisaged to develop lightweight, adaptable EDU segmenters for low-resource languages and reduce dependence on labor-intensive treebanking.

A plausible implication is that the ESURF approach, which exploits lightweight lexical and morphological features, is not only statistically robust but also computationally efficient, making it suitable for large-scale or low-resource discourse processing contexts (Sediqin et al., 13 Jan 2025).

6. Contextual Significance within Discourse Segmentation Research

ESURF signifies a methodological shift toward minimalist, training-efficient discourse segmentation architectures, contrasting with both heavyweight neural approaches and linguistically intensive rule-based systems. Its demonstrated ability to outperform deep transformer baselines such as BERT and XLNet underlines the centrality of local lexical and morphological cues in discourse segmentation—a finding consistent with broader linguistic evidence across languages and genres. The modular, language-independent nature of the feature set encourages practical adaptation to new domains, including news, biomedical, and low-resource languages. The method also provides a strong baseline for integration into more sophisticated discourse parsing or downstream applications, such as automatic summarization, question answering, and argument mining, where fine-grained and accurate EDU segmentation is foundational (Sediqin et al., 13 Jan 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RST-LoRA Architecture.