Discourse Particle Disambiguation

Updated 24 November 2025

Discourse particle disambiguation is the task of assigning context-dependent discourse functions to ambiguous markers, vital for accurate NLP analysis.
Approaches range from manual feature engineering and regression to neural and latent-variable models that leverage syntactic and pragmatic cues.
Benchmarks and cross-linguistic studies reveal challenges such as polysemy and genre variability, guiding future research innovations.

Discourse particle disambiguation is the task of determining the precise discourse-semantic function that a polyfunctional discourse particle or marker realizes in natural language context. Discourse particles (also called discourse markers, connectives, or cue phrases) such as “just,” “otherwise,” “since,” “and,” or their cross-linguistic analogues, mediate a range of rhetorical, logical, pragmatic, or informational effects—temporal ordering, contrast, consequence, exclusion, emphasis, and more—within or between clauses. Because these particles are often ambiguous and overloaded, accurate disambiguation is critical for fine-grained discourse understanding, automated discourse parsing, and downstream NLP applications. The problem lies at the intersection of lexical semantics, pragmatics, and compositional syntax, requiring both context-sensitive analysis and robust modeling of subtle semantic and pragmatic signals.

1. Formal Definitions and Task Formulations

Discourse particle disambiguation targets the sense-level classification of polysemous discourse markers within a given context. For a particle token $P$ in a sentence or clause-pair $S$ , the model selects a label $l \in L$ from a particle-specific inventory of discourse functions, defined via formal glosses and annotated exemplars. The typical modeling objective is:

$\hat{y} = \arg\max_{l \in L} p(l | S)$

For English “just,” example senses include exclusionary, temporal, emphatic, unelaboratory, unexplanatory, and adjectival (“fair, lawful”); for “otherwise,” senses include consequence, argumentation, enumeration, and exception (Li et al., 17 Nov 2025).

Beyond English, the task extends to other languages and marker systems. In Korean, for instance, the selection among post-positional particles (−lul, −nun, null) reflects a mixture of discourse and grammatical features: −nun signals non-cancelable exhaustivity implicature (contrastive topic), −lul signals cancelable exhaustivity (canonical accusative), and null-marking signals discourse givenness (Shin et al., 15 May 2024).

A related binary task is connective-usage disambiguation: for each candidate connective in context, predict whether it signals a discourse relation at all (y ∈ {DISCOURSE, NON-DISCOURSE}) (Laali et al., 2017).

2. Datasets, Annotation, and Benchmarks

Recent work has introduced dedicated benchmarks and annotated corpora for discourse particle sense labeling:

BeDiscovER (2025): Aggregates manually annotated test sets for polyfunctional English adverbs and connectives, including:
- Just-Manual: 90 written-source English examples (6-way sense annotation).
- Just-Subtitle: 139 movie subtitle examples.
- Otherwise: 294 clause-pairs with “otherwise,” annotated with 4-way sense labels.
GUM Corpus (eRST Framework): ~250K tokens over 16 English genres, with explicit annotation of discourse markers, 45 non-marker signal subtypes, and rhetorical relations (Wu et al., 22 Jul 2025).
French Discourse Treebank: LEXCONN connectives annotated for discourse vs. non-discourse usage; 10K positive and 40K negative instances (Laali et al., 2017).
Korean DOM Corpus: Controlled stimuli contrasting −lul, −nun, and null in discourse contexts, annotated for implicature cancelability and pragmatic felicity (Shin et al., 15 May 2024).

Annotation schemes typically involve both categorical sense inventories and fine-grained signal coding, with inter-annotator agreement levels exceeding 98% in English and French benchmarks.

3. Modeling Approaches and Feature Sets

Approaches to discourse particle disambiguation fall into several paradigms:

Manual and Feature-based Models

Lexico-syntactic Feature Engineering: MaxEnt classifiers using connective token, sentence position, parse tree context (parent, self-category, left/right sibling labels). Major cues include marker string, syntactic context, and positional features. For French, this yields up to 94.2% accuracy, with the lexical feature alone accounting for 89.1% and incremental syntactic context features boosting results further (Laali et al., 2017).
Regression and Entropy-based Feature Augmentation: Signal diversity, marker entropy (Shannon H(X) over senses), genre identifiers, and counts of co-occurring non-marker “discourse relation signals” (e.g., lexical, semantic, syntactic cues) used as predictive features in multi-class or logistic regression frameworks (Wu et al., 22 Jul 2025).

Latent-Variable and Neural Models

Distributed Marker Representation (DMR): Models the ambiguous mapping between surface markers and latent discourse senses ( $z$ ), parameterizing both $p(z | s_1, s_2)$ and $p(m | z)$ via transformer-based encoders and sense/marker embedding matrices. Disambiguation is achieved as:

$p(z | s_1, s_2, m) \propto p(z | s_1, s_2) \cdot p(m | z)$

yielding context-sensitive soft sense-selection (Ru et al., 2023).

LLMs: Off-the-shelf and reasoning-tuned LLMs (Qwen3, DeepSeek-R1, GPT-5-mini, Polyglot-Ko, GPT-3/4) evaluated via zero-shot prompting. Variants include basic prompts, definitional additions, and one-shot in-context exemplars. The best models leverage explicit definitions and examples, especially in over-14B parameter settings (Li et al., 17 Nov 2025, Shin et al., 15 May 2024).

Sequence Tagging and Attention

CRF/BiLSTM/Attention-based Sequence Labeling: Frameworks that encode the sequence of signal tokens (DMs and non-DMs), genre-level features, and discourse context, allowing for neural attention over auxiliary cues (Wu et al., 22 Jul 2025).

4. Evaluation Metrics and Experimental Findings

Standard metrics include accuracy (proportion of correct sense labels), weighted/macro precision, recall, F1 (class-weighted), and specialized variants such as surprisal or log-probability of continuation in production settings (Li et al., 17 Nov 2025, Shin et al., 15 May 2024).

Key results:

LLMs:
- Performance rises steeply above 2B parameters, with DeepSeek-R1 and GPT-5-mini (“Def+Exp” prompts, reasoning mode) reaching 66–71% accuracy on six- and four-way disambiguation (Li et al., 17 Nov 2025).
- Smaller models (<2B) are harmed by rich prompts—“prompt overload.”
Feature-based Models:
- French connectives: MaxEnt with lexical and syntactic context hits 94.2% accuracy. Lexical (marker string) is the dominant feature, but context adds substantial gains—especially on high-entropy markers (Laali et al., 2017).
DMR:
- On implicit discourse relation recognition, DMR-large achieves 64.1% accuracy and 43.8 macro-F1, outperforming previous unsupervised systems (Ru et al., 2023).
Non-DM signal diversity:
- Shannon entropy of DM sense correlates significantly with signal diversity (r = 0.248, p = 0.0137), but not with total signal count (Wu et al., 22 Jul 2025).
Cross-linguistic/Pragmatic Encoding:
- Only GPT-3/4 and RLHF-tuned models approach humanlike sensitivity to non-cancelable discourse implicatures (e.g., −nun in Korean). However, even LLMs underperform in robustly integrating semantic and pragmatic cues for every marker (Shin et al., 15 May 2024).

Common error patterns include confusion of rare or contextually subtle senses, heightened challenge in informal registers (subtitles, dialogue), and undersensitivity to discourse-pragmatic constraints absent explicit cues (Li et al., 17 Nov 2025, Shin et al., 15 May 2024).

5. Role of Polysemy, Signal Diversity, and Genre

Polysemy in discourse markers—measured by Shannon entropy over sense distributions—is a central factor in disambiguation difficulty. Markers with high entropy (e.g., “so,” “and,” “if”) necessitate sensitivity to a greater variety of co-occurring signals. Regression models show that entropy alone explains ~7% of variance in signal diversity, with genre interactions raising this to 9% (Wu et al., 22 Jul 2025).

Non-DM cues—lexical, graphical, reference, syntactic, semantic—serve as critical context for disambiguation. The diversity (not just frequency) of such cues predicts difficulty and model performance. Spoken genres (court, podcast, vlog) exhibit strong context interaction, and genre-specific signaling strategies are evident from statistical analyses.

A summary of predictive features and their utility in classifier construction is given below:

Feature Type	Predictive Value	Reference
Marker Entropy	Sense diversity	(Wu et al., 22 Jul 2025)
Signal Diversity	Disambiguation accuracy	(Wu et al., 22 Jul 2025)
Genre/Categorical	Modulates effect sizes	(Wu et al., 22 Jul 2025)
Lexical/Syntactic	High precision/recall	(Laali et al., 2017)

6. Challenges, Limitations, and Open Problems

Several sources highlight the fundamental challenge posed by fine-grained, pragmatic, and context-dependent senses:

Rare Functions: Certain senses (e.g., adjectival “just”) remain difficult for all models due to lack of training data and weaker contextual signatures (Li et al., 17 Nov 2025).
Pragmatic-Discourse Tension: In languages with differential marking (Korean, Turkish, Spanish), discourse particle choice depends on intricate balancing of semantics (animacy, specificity) and pragmatics (exhaustivity, contrastiveness); LLMs struggle to jointly model these (Shin et al., 15 May 2024).
Prompt/Cognitive Overload: Excessive prompting (rich definitions, in-context examples) can harm smaller models, reflecting limited context or representational capacity (Li et al., 17 Nov 2025).
Robustness to Perturbation: Omission or incorrect substitution of connectives—in both training and evaluation—reveals that many NLP systems are brittle and rely heavily on surface tokens rather than deep semantic inference (Li et al., 2023).
Absence of Dedicated Architectures: Most recent advances rely on general-purpose LLMs and repurposed encoders; specialized or joint training for discourse particle sense remains rare (Li et al., 17 Nov 2025, Ru et al., 2023).
Implicit/Non-explicit Marking: Disambiguation is substantially harder when cues are indirect (implicit discourse relations, non-canonical signal types) (Wu et al., 22 Jul 2025).

7. Prospects and Future Directions

Research identifies several avenues to improve discourse particle disambiguation:

Few-shot/adaptive fine-tuning: Targeted in-domain fine-tuning or adapter modules on sense-annotated data to close the gap for rare senses and informal genres (Li et al., 17 Nov 2025).
Chain-of-thought Augmentation: Prompts or architectures that elicit or encode explicit inference about exclusivity, speaker intent, and pragmatic cues (Li et al., 17 Nov 2025).
Cross-linguistic Extension: Transfer protocols to other polyfunctional markers and language families (e.g., applying methods developed for Korean −lul/−nun/null to Turkish, Spanish, Hindi) (Shin et al., 15 May 2024).
Integration of Non-Particle Cues: Joint modeling of marker tokens and non-marker discourse signals, leveraging diversity indices and genre features (Wu et al., 22 Jul 2025).
Structural and Training Innovations: Combining attention-based models with structural features (syntactic parse trees, clause segmentation), connective-perturbation augmentation, and explicit connective-sense objectives (Li et al., 2023).
Joint Multi-task Objectives: Multi-task training on related tasks such as focus-projection, scalar implicature, and discourse parsing may offer gains in function labeling (Li et al., 17 Nov 2025).

Substantial progress has been made in benchmarking, formalization, and high-level system evaluation, but robust compositional models for discourse particle disambiguation, especially in pragmatically complex or low-resource settings, remain an open research challenge. The field continues to move toward unified frameworks that integrate lexical, syntactic, semantic, and pragmatic information to approach the flexibility and nuance of human discourse processing.