BeDiscovER: Discourse Understanding Benchmark

Updated 24 November 2025

The paper introduces a benchmark that disambiguates discourse particles by framing the task as a classification problem with entropy-based polysemy measures.
It employs multiple modeling approaches, including zero-shot LLM prompting and supervised classifiers, to evaluate context-driven marker identification.
Feature-based methods combine lexical, syntactic, and genre signals to achieve nuanced, state-of-the-art performance across languages.

Discourse particle disambiguation is the process of determining, for any given occurrence of a polyfunctional particle or discourse marker (e.g., "just," "otherwise," "since," "but") in context, the specific discourse-semantic function or sense that it realizes. This task is central to computational models of coherence and discourse relation parsing, where explicit and implicit cues jointly signal fine-grained rhetorical and pragmatic relations across linguistic domains and languages.

1. Formalization and Sense Inventories

Discourse particle disambiguation is typically posed as a classification task: given an input span (most commonly a sentence or paired clauses) containing a marked particle, the goal is to assign a sense label $l$ from a fixed function inventory. Formally, the task is:

$\hat{y} = \underset{l\in L}{\arg\max}~ p(l~|~S)$

where $S$ is the input context and $L$ is the particle’s sense inventory. High-value use cases include English adverbs such as "just," which can function as Exclusionary, Unelaboratory, Unexplanatory, Emphatic, Temporal, or Adjective, or conjunctions like "otherwise" with Consequence, Argumentation, Enumeration, Exception functions (Li et al., 17 Nov 2025).

Polyfunctionality is often quantified by information-theoretic entropy: let $X$ be the set of discourse relations associated with a given marker. The entropy $H(X)$

$H(X) = -\sum_{i=1}^n P(x_i) \cdot \log_2 P(x_i)$

expresses graded polysemy, with higher $H$ indicating a more even distribution across senses (Wu et al., 22 Jul 2025).

2. Data Resources and Annotation Protocols

Specialized datasets for English include Just-Manual and Just-Subtitle corpora for "just" (manual/material vs. spoken/subtitle), and Otherwise for "otherwise," with each sense defined by a short formal gloss plus a prototype example (Li et al., 17 Nov 2025). The French Discourse Treebank (FDTB) and its source LEXCONN lexicon provide extensive coverage for French, supporting binary discrimination between DISCOURSE and NON-DISCOURSE uses (Laali et al., 2017).

For evaluation, only test portions are typically used when training or development splits are not provided, as in BeDiscovER (Li et al., 17 Nov 2025). Cross-linguistic approaches annotate marker functions within frameworks such as eRST, capturing both explicit markers and seven classes of non-DM relation signals, spanning up to forty-five subtypes in English (GUM corpus: 255 documents, sixteen genres) (Wu et al., 22 Jul 2025).

In morphologically-rich languages, particle disambiguation involves differential object markers and their interaction with both semantic and discourse cues, as in Korean—where “-lul” (accusative), “-nun” (contrastive topic), and null-marking each evoke distinct implicature and exhaustivity profiles (Shin et al., 15 May 2024).

3. Modeling Approaches

Zero-Shot LLM Prompting

BeDiscovER evaluates LLMs via multiple-choice prompts, using different prompt styles:

Basic: sentence + question + label list
Def: adds a one-sentence definition per label
Def+Exp: further includes one labeled example per sense.

Open-source reasoning models (Qwen3 family, DeepSeek-R1) and proprietary models (GPT-5-mini) are tested with both "without reasoning" and "with reasoning" (chain-of-thought, extended token budget, multiple random seeds) modes. No dedicated architectures are introduced; all models operate in zero-shot classification (Li et al., 17 Nov 2025).

Supervised and Latent-Variable Models

Maximum entropy classifiers over local lexical/syntactic features—connective string, sentence position, parse-tree context (SelfCat, parent, siblings)—achieve strong results in French (FDTB: 94.2% accuracy) and are generalizable to other languages (Laali et al., 2017). Feature ablation pinpoints the unique contribution of each context type.

Distributed Marker Representation (DMR) introduces a latent-variable neural model for English, treating the discourse sense $z$ as a hidden variable and learning both $p(z | s_1, s_2)$ and $p(m | z)$ . DMR is trained in an EM-like fashion and provides both sense probabilities and interpretable embeddings corresponding to sense clusters, yielding state-of-the-art accuracy for implicit relation recognition and interpretable disambiguation (Ru et al., 2023).

Feature-based models also incorporate explicit polysemy scores (entropy), counts and diversity of co-occurring non-DM signals (e.g., lexical, syntactic, semantic cues), and genre as categorical variables in regression or neural architectures (Wu et al., 22 Jul 2025).

Surprisal-Based and Pragmatic Probing

In Korean, context-dependent prediction of post-positional markers via conditional probabilities, mean surprisal, and forced-choice paradigms directly probes the interplay of semantic and discourse distinctions underlying marker selection and implicature cancelability (Shin et al., 15 May 2024).

4. Evaluation and Empirical Results

Standard metrics include accuracy, weighted precision/recall/F1, and ablation-based error reduction. Macro and micro F1 are used to assess relation-level disambiguation.

Notable findings from BeDiscovER (Li et al., 17 Nov 2025):

Dataset	Best Accuracy ("Def+Exp", reasoning)
Just-Manual	67.0% (DeepSeek-R1), 66.1% (GPT-5-mini)
Just-Subtitle	63.4% (GPT-5-mini, high effort)
Otherwise	71.8% (GPT-5-mini, high effort)

Performance improves sharply with model scale (notable threshold ≈2B parameters). Definitional/contextual prompts benefit only the largest models (>14B); smaller models suffer from prompt overload, particularly with noisier, informal data. Rare senses remain challenging even for top models. Non-DM signals facilitate sense discrimination; genre-sensitive signal diversity, not quantity, is predictive (signal diversity Model 1: $\beta_1=0.112$ , $p=0.005$ for entropy predictor) (Wu et al., 22 Jul 2025).

In Korean, large models and RLHF improve sensitivity to pragmatic constraints of object marking, yet struggle to capture cancelability distinctions jointly encoded by morphological and context cues (Shin et al., 15 May 2024).

5. Feature Engineering and Predictive Cues

Effective features for disambiguation include:

Lexical particle identity ("Conn")
Polysemy/entropy score for the marker
Local syntactic context (SelfCat, parent, siblings in the parse tree)
Counts and types of co-occurring non-DM signals
Explicit genre/register (16 genres in GUM corpus)

Regression and classification models demonstrate that diversity in signaling, rather than signal count, underlies successful disambiguation (Pearson $r=0.248$ , $p=0.0137$ for entropy vs. total signals; adjusted $R^2=0.071$ for entropy predicting signal diversity) (Wu et al., 22 Jul 2025).

Attention mechanisms or latent-variable models can further exploit these cues by weighting or representing marker-sense mappings in a soft, context-aware fashion (Ru et al., 2023).

6. Limitations, Error Analysis, and Future Directions

State-of-the-art LLMs and feature-based models achieve moderate (60–70%) accuracy on fine-grained function labeling of polyfunctional English adverbs and conjunctions. Informal, contextually underspecified cases, rare senses, and cross-sentential dependencies remain error-prone. Smaller models exhibit prompt overload, while even the largest lack robust lexical-pragmatic competence for low-frequency or highly contextualized senses (Li et al., 17 Nov 2025). Models trained only with surface token cues are susceptible to performance drops when explicit connectives are omitted or replaced (Li et al., 2023).

For languages with rich morphology and pragmatic marking, purely distributional LLMs do not yet match human patterns in implicature computation and cancelability, suggesting that specialized data or multi-task objectives are required (Shin et al., 15 May 2024).

Future work should explore:

Few-shot adapter or in-domain fine-tuning to close the performance gap in data-scarce settings
Explicit chains-of-thought to encourage pragmatic reasoning about exclusivity, intent, and context
Extension to other particles, languages, and related tasks (e.g., focus projection, scalar implicature)
Incorporation of genre-sensitive feature sets and signal diversity measures as inputs to neural architectures

7. Applications and Theoretical Significance

Discourse particle disambiguation is foundational for discourse parsing, rhetorical relation treebanking, and downstream NLP tasks involving document-level understanding, question answering, and summarization. The integration of entropy-based polysemy measures, diverse signal inventories, and interpretability via latent-variable modeling are advancing the state of the art in both monolingual and cross-linguistic settings (Li et al., 17 Nov 2025, Wu et al., 22 Jul 2025, Ru et al., 2023, Laali et al., 2017, Shin et al., 15 May 2024).

Empirical results point to the importance of modeling both explicit markers and heterogeneous co-occurring signals, as well as aligning computational architectures with the subtle pragmatic reasoning evidenced in human discourse comprehension. Continued research is likely to focus on scaling models for better generalization, more nuanced handling of morphological-pragmatic interfaces, and transfer across genres and languages.

PDF Markdown Chat (Pro)

References (6)

BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models (2025)

Unpacking Ambiguity: The Interaction of Polysemous Discourse Markers and Non-DM Signals (2025)

Automatic Disambiguation of French Discourse Connectives (2017)

Do language models capture implied discourse meanings? An investigation with exhaustivity implicatures of Korean morphology (2024)

Distributed Marker Representation for Ambiguous Discourse Markers and Entangled Relations (2023)

When Do Discourse Markers Affect Computational Sentence Understanding? (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models).