Neural Architectures for Discourse Modeling

Updated 12 September 2025

Neural architectures for discourse modeling are systems that use hierarchical, recursive, and latent variable methods to capture text structure beyond individual sentences.
They integrate compositional cues and inter-sentence relations to enhance tasks such as dialogue act classification, summarization, and document understanding.
Recent advances include discourse-aware pre-trained models and efficient parsers which achieve near-human performance in parsing complex discourse structures.

Neural architectures for discourse modeling comprise a spectrum of model designs and learning paradigms that encode and reason about text structure beyond the sentence level. These systems aim to capture compositional semantics, inter-sentential relations, discourse coherence, structural hierarchy, entity transitions, and conversational pragmatics inherent in extended texts and dialogues. Progress over the past decade includes hierarchical encoders, conditioned recurrent models, variational generative frameworks, explicit structure-aware neural parsers, and architectures integrating discourse theory, yielding advances in tasks such as dialogue act classification, discourse parsing, coherence scoring, document understanding, and abstractive summarization.

1. Principles of Discourse Representation in Neural Architectures

Central to neural discourse modeling is the abstraction of the compositional principle: texts are modeled not only as sequences of words but as structures in which words form sentences, sentences combine into paragraphs or conversational turns, and these, in turn, yield document- or dialogue-level meaning. Early neural approaches implement hierarchical composition, e.g., mapping word vectors to sentence vectors and then sequentially encoding sentence vectors to produce discourse representations (Kalchbrenner et al., 2013), or, more generally, leveraging hierarchical encoders with multiple granularity levels such as token, section, and document (Cohan et al., 2018).

The explicit modeling of discourse relations—be they rhetorical (RST), shallow (e.g., PDTB connectives), or pragmatic (dialogue acts)—has led to the development of latent variable models (Ji et al., 2016, Zhang et al., 2016) and discriminative sequence models (Dai et al., 2018) that mediate word prediction, sentence generation, or explicit relation classification by considering inter-unit dependencies as latent or labeled variables.

Recently, architectures directly inject global or local discourse dependencies, employ discourse-aware attention, or augment pre-trained LLMs (PLMs) with mechanisms such as predictive coding to enforce discourse-level prediction tasks (Araujo et al., 2021).

2. Hierarchical and Recursive Models for Discourse Compositionality

Hierarchical models operationalize compositionality at multiple levels:

The Hierarchical Convolutional Neural Network (HCNN) computes sentence vectors by convolving feature-wise across word vectors, using kernels of increasing size to integrate local and global word groupings. This forms the foundation for discourse-level RNNs that treat sentence vectors as input to recurrent compositions incorporating speaker conditioning (Kalchbrenner et al., 2013).
Recursive neural networks, particularly in conjunction with structures induced by RST parses, enable document representations that respect the hierarchical and relational structure of discourse (Ji et al., 2017). Here, leaves (EDUs) are embedded via bidirectional LSTMs, and internal nodes aggregate these embeddings recursively, with per-relation composition matrices and attention, to represent larger discourse spans.
For Chinese discourse, entity-driven recursive architectures account for both tree-structured sentence representations and explicit entity (noun) overlap across adjacent sentences, leading to improved coherence modeling (measured via sentence ordering and translation quality) (Xu et al., 2017).

This paradigm enables models to integrate both local syntactic/semantic cues and broad discourse-level relations, supporting tasks from dialogue act tagging (where speaker and sequential information is crucial) to document categorization.

3. Latent Variable and Generative Discourse Models

A major strand of research extends RNN LLMs with latent variables representing discourse relations between adjacent sentences. The Latent Variable RNN (LVRNN) posits a latent discrete variable $z_t$ for the discourse relation between $y_{t-1}$ and $y_t$ , integrating this into both relation classification and word prediction via the joint factorization:

$p(y_{1:T}, z_{1:T}) = \prod_t p(z_t \mid y_{t-1}) \cdot p(y_t \mid z_t, y_{t-1})$

Training objectives maximize either the joint or conditional log-likelihood, and marginalization of latent discourse variables at test time leads to improved perplexity for language modeling (Ji et al., 2016).

The Variational Neural Discourse Relation Recognizer (VarNDRR) further employs a continuous latent variable $z$ to generate both the discourse argument pair $(x_1, x_2)$ and the discourse relation $y$ , with factorization $p(x, y, z) = p(x|z) p(y|z) p(z)$ . Approximations to the posterior and prior over $z$ are learned via neural networks, and parameters are updated to maximize a variational lower bound via the reparameterization trick (Zhang et al., 2016). This approach achieves competitive F1 scores against strong feature-driven baselines, especially on PDTB Expansion and Comparison relations.

Unsupervised generative models have also been introduced for microblog conversations, jointly modeling latent topics and discourse roles as distinct latent variables and explicitly minimizing their mutual information to enforce decorrelation, optimizing a variational lower bound including reconstruction, KL divergence, and an MI penalty (Zeng et al., 2019).

4. Discourse Structure Parsing and Explicit Structural Modeling

Parsing explicit discourse structure, especially as defined by RST, has motivated the development of efficient neural parsers:

Pointer Network-based frameworks for both EDU segmentation and discourse tree construction operate in linear time, exploiting shared encoders and dynamic attention over potential split points or boundaries (Lin et al., 2019).
Top-down neural architectures cast segmentation and structure building as recursive split point ranking tasks, using encoder–decoder architectures with internal stacks and biaffine attention to select splits and jointly predict nuclearity and relation labels (Zhang et al., 2020).
The effectiveness of these approaches in full-tree parsing is demonstrated by results approaching human agreement (segmentation F1: 95.4, parsing F1: 81.7); the modularity and linear-time operation make such models tractable for long documents and support broad applicability, including MT, summarization, and NLU (Lin et al., 2019).

Lightweight architectures such as LiMNet demonstrate that robust discourse parsing can be achieved without deep feature extractors, instead relying on fixed PLM embeddings and two self-attention modules to build both local and global sentence representations, benefiting generalizability and reducing overfitting while maintaining strong macro- and micro-F1 scores on profiling, RST, and PDTB parsing (Li et al., 2022).

5. Contextual, Coherence, and Conversation Modeling

Discourse coherence and conversation modeling have benefited from both discriminative and generative frameworks:

Discriminative models classify the coherence of sentence cliques using LSTM-encoded representations, discriminating between coherent and incoherent permutations (Li et al., 2016).
Generative models, including variational latent variable models (VLV-GM), capture latent discourse dependencies across sentences, with variational objectives that incentivize the latent representation $z_n$ to encode gradually evolving discourse states. Such models excel in sentence (or paragraph) ordering and adversarial sentence generation (Li et al., 2016).
In conversation modeling, hierarchical encoder–decoder frameworks extend seq2seq models by introducing an additional RNN layer over utterance representations (Nseq2seq+A), capturing long-range discourse across dialogue turns. The application of attention at the utterance level supports the generation of contextually coherent conversational outputs, with quantitative gains in perplexity and qualitative improvements in discourse marker usage (such as deixis and logical consequence) as context length increases (Pierre et al., 2016).
For multi-party dialogue parsing, sequential models build discourse dependency trees by predicting parent-child links and relation types for each EDU in order, leveraging both local and global structured representations and a speaker highlighting mechanism. This sequential, incremental approach achieves state-of-the-art performance in parsing the discourse structure of complex dialogues (Shi et al., 2018).

6. Integration of Discourse Structure with Pretrained LLMs

Recent proposals focus on integrating explicit discourse-level signals into large pretrained LLMs (PLMs):

Augmenting BERT-style encoders with predictive coding introduces explicit top-down connections via a GRU-based autoregressive module, producing context vectors that generate predictions of future sentence representations. The training includes an InfoNCE contrastive objective between predicted and actual sentence representations, effectively improving discourse-level representation quality on benchmarks such as DiscoEval, PDTB, RST, and SciDTB-DE (Araujo et al., 2021).
Modulating PLMs via lightweight, learnable self-attention (as in LiMNet) allows robust performance for discourse tasks without full finetuning or complex feature stacking, preserving PLM generalization and substantially reducing parameter count and computation cost (Li et al., 2022).
Document modeling is further enhanced by explicitly injecting discourse features derived from neural RST parsing—either as shallow nuclearity/relation scores or as latent BiLSTM-derived features—into downstream tasks such as summarization and popularity prediction, often via concatenation, auxiliary BiLSTMs, or modification of attention weights. Such discourse-informed representations consistently yield recall and F1 improvements in summarization and reduced error in regression tasks (Koto et al., 2019).

7. Empirical Outcomes, Challenges, and Future Directions

Empirical evaluation across dialogue act tagging (RCNN: 73.9% accuracy) (Kalchbrenner et al., 2013), implicit discourse relation classification (LVRNN: ~59.5% accuracy, improvement over 55–57% baselines) (Ji et al., 2016), discourse parsing (segmentation: 95.4 F1, parsing: 81.7 F1) (Lin et al., 2019), and text coherence (VLV-GM outperforming baselines in Kendall's τ) (Li et al., 2016) demonstrates the impact of neural architectures that explicitly model discourse-level properties.

Challenges remain: current PLMs, even those sensitive to discourse context, may encode but not exploit discourse structure in all tasks; e.g., implicit causality influences reference but not syntax in transformer LMs (Davis et al., 2020). The effectiveness of structure-aware models can be sensitive to parsing quality and genre mismatch (Ji et al., 2017). Lightweight models trade off a small drop in maximal performance for gains in efficiency and robustness (Li et al., 2022).

Ongoing work targets the deeper integration of explicit structural supervision (RST, dependency parses), improved unsupervised and variational approaches for low-resource settings, more generalizable models across genres and languages, and joint models that unify parsing, relation classification, and downstream document understanding. Incorporating discourse-aware mechanisms into PLMs, exploring new global modeling objectives, and expanding annotated resources for emergent discourse phenomena are areas of active study.

In summary, neural architectures for discourse modeling have evolved from hierarchical and recursive neural networks to latent variable generative frameworks, structure-aware parsers, and discourse-augmented pretrained models. These advances enable deep integration of local compositionality, global structural signals, and latent discourse states, supporting a host of core and applied NLP tasks that require robust understanding beyond the sentence boundary.