Topical Hierarchical Recurrent Encoder Decoder
- THRED is a dialogue generation architecture that combines hierarchical RNN models with topic modeling and multi-level attention to produce topically coherent responses.
- It employs a two-level encoder to capture both utterance-level and contextual features while integrating LDA or NMF-derived topical signals through joint attention mechanisms.
- Empirical results show enhanced response diversity and contextual relevance, validated using metrics like Semantic Similarity, Response Echo Index, and distinct n-gram measures.
A Topical Hierarchical Recurrent Encoder Decoder (THRED) is a class of dialogue generation architectures that augment hierarchical recurrent encoder–decoder models with mechanisms for topic modeling and topic-aware attention, enabling the generation of diverse, contextually appropriate, and topically coherent responses in multi-turn conversational settings. Multiple variants have been proposed; the most cited are (1) context-aware topical attention approaches that integrate LDA-derived topic signals via joint attention mechanisms and (2) topic-coherent diversification schemes combining global latent variables and word-level topic biases for further enhancing diversity without sacrificing topical relevance (Dziri et al., 2018, Hu et al., 2019).
1. Core Architecture and Hierarchical Modeling
The THRED architecture extends canonical sequence-to-sequence (Seq2Seq) models with hierarchical encoding and joint attention. In one formulation (Dziri et al., 2018), the model encodes multi-turn dialogue by first processing each utterance via a (bi/unidirectional) GRU to obtain utterance-level representations, then passing these through a second-level GRU (the context encoder) to capture the flow of conversation across turns. This two-level hierarchical structure allows better modeling of discourse dependencies and conversational structure compared to flat encoders.
A distinct line of work applies a similar hierarchical recurrent backbone but uses LSTMs at both token- and utterance-level, and further incorporates variational (global) latent variables, as in VHRED (Hu et al., 2019). This supports richer, more diverse generation by enabling sampling of the global conversational context at inference time.
2. Joint Attention and Topical Signal Integration
At decoding time, THRED models implement a multi-pronged attention scheme:
- Message-level (word) attention attends to word states within each utterance to form utterance summaries.
- Context-level attention attends over the sequence of past utterance summaries, aggregating information across dialogue turns.
- Topic-level attention incorporates topical concept vectors extracted by unsupervised topic modeling (LDA or NMF-based). Specifically, the dialogue history is assigned to a most probable topic, from which the top-n topic words are selected and embedded. These topic embeddings are aggregated by attention, producing a topic-context vector that is fed into the decoder and/or biases the output distribution toward topical vocabulary (Dziri et al., 2018).
In the diversification-oriented THRED variant, topical information is encoded as a dense matrix constructed via NMF on a word-word Positive Pointwise Mutual Information (PPMI) matrix, producing latent topic vectors for each word (Hu et al., 2019). Per-turn topic distributions are then derived and injected into the decoder, enhancing topical control at generation time.
3. Mathematical Formulation
Let denote a multi-turn dialogue comprising utterances.
Utterance Encoder (per (Dziri et al., 2018)):
For the -th utterance: where is the -th word of .
Word (Message)-Level Attention:
At decoder time , for each utterance :
Context Encoder:
Joint Attention in the Decoder:
- Context Attention:
- Topic Attention:
Let be top-n topic word embeddings from the inferred topic.
Decoder state:
Output distribution over response- and topic-vocabularies is the normalized sum of two MLP heads, biasing the generation toward on-topic content.
In the diversification-oriented THRED (Hu et al., 2019), the decoder LSTM also receives a sampled global latent variable (from ) and the per-utterance topic vector :
4. Topical Concept Extraction and Topic Modeling
In (Dziri et al., 2018), Latent Dirichlet Allocation (LDA) with 150 topics is trained via collapsed Gibbs sampling on large conversational corpora (Reddit 1M dialogues and OpenSubtitles). For each new dialogue, the most probable topic is inferred; the top words by topic-word probability are selected, embedded, and used for topic-level attention during decoding.
Alternatively, (Hu et al., 2019) constructs a word-topic matrix via NMF on a pointwise mutual information matrix, providing continuous-valued topic distributions for all vocabulary items. Local topic vectors for utterances are computed as: with normalization via softmax.
5. Training Objectives and Optimization
In both THRED variants, the primary objective is maximum likelihood estimation (negative log-likelihood) over the training triples of (dialog history, inferred topic, next utterance): The diversification-oriented version uses a variational lower bound with a KL divergence term for the latent variable : A local topic regularization encourages topic coherence between context and generated response, using KL divergence between their local topic distributions: The final loss is a weighted sum of global and local objectives.
All model parameters, including word embeddings, RNN weights, and MLP heads, are trained end-to-end, typically using Adam optimizer with learning rate and dropout 0.2.
6. Evaluation Metrics and Quantitative Performance
Two novel automated metrics are introduced in (Dziri et al., 2018):
- Semantic Similarity (SS):
SS between generated response and previous utterances is measured as a brevity/dullness-penalized, cosine-based distance between Universal Sentence Encoder embeddings. Lower SS implies higher semantic coherence to context.
- Response Echo Index (REI):
Maximal Jaccard similarity (after lemmatization and stop-word removal) between each response and a held-out subset of training utterances. Lower REI indicates less propensity to echo training data verbatim.
In (Hu et al., 2019):
- TopicDiv:
KL-divergence between the post and response topic vectors; lower values indicate tighter topical alignment.
- Distinct-n:
Fraction of unique n-grams in the output (diversity measure).
- F-score for Diversity–Coherence Tradeoff:
Combines Distinct-n and (1-TopicDiv) in an measure.
Empirical results show that THRED exhibits:
- Substantially lower SS and REI than strong baselines (e.g., THRED/SS = 0.649 vs. HRED = 0.720, THRED/REI = 0.546 vs. HRED = 0.617).
- Enhanced diversity (Distinct-2 improved by 37% over reference models).
- Perplexity close to, or marginally higher than, diversity-unaware baselines.
- Statistically significant improvements in human ratings, e.g., mean score 2.20 (THRED) vs. 1.88 (best baseline).
- Ablation shows both hierarchical architecture and topic attention contribute independently to gains in contextuality and topicality (Dziri et al., 2018, Hu et al., 2019).
7. Significance, Applications, and Extensions
THRED advances conversational response generation by jointly modeling multi-turn context and latent topical structure, directly addressing limitations of generic, context-insensitive sequence generation. The dual attention and topic-biasing mechanisms lead to greater diversity and more topic-aligned dialogue, as validated by both automatic and human assessment.
Although (Dziri et al., 2018) and (Hu et al., 2019) employ LDA and NMF for topic modeling respectively, the modularity of the framework admits alternative topic information sources. The combination of hierarchical context representation, topic-aware attention, and diversity-promoting global variables produces outputs that better reflect both dialog history and latent topical structure, applicable in domains requiring informative, contextually appropriate system utterances.
A plausible implication is the potential for further advances in conversational modeling by integrating more nuanced conversational context, dynamic topic tracking, or end-to-end differentiability in topic extraction modules.
References:
- "Augmenting Neural Response Generation with Context-Aware Topical Attention" (Dziri et al., 2018)
- "Diversifying Topic-Coherent Response Generation for Natural Multi-turn Conversations" (Hu et al., 2019)