Adversarial Learning Augmentations in hredGAN
- The paper introduces hredGAN, a model that augments HRED with adversarial noise injection to enhance dialogue response diversity and informativity.
- It leverages a dual network architecture where a generator and a discriminator are jointly trained using both MLE and GAN objectives for improved context relevance.
- Empirical evaluations on datasets like MTC and UDC show significant improvements in perplexity, BLEU, ROUGE, and human evaluation scores compared to baseline models.
Adversarial Learning Augmentations (hredGAN) constitute a generative modeling paradigm for multi-turn dialogue response generation. Utilizing conditional generative adversarial networks (GANs), hredGAN augments hierarchical recurrent encoder–decoder (HRED) frameworks with adversarial training to improve response diversity, informativeness, and relevance, particularly in settings with limited supervision or training data. The approach introduces stochastic noise to the generator’s latent space, enabling the system to synthesize a spectrum of plausible responses conditioned on dialogue history, with final output selection guided by a discriminator network evaluating sequence realism and context relevance (Olabiyi et al., 2018).
1. Adversarial Learning Augmentation Framework
hredGAN is built upon a modified HRED sequence modeling backbone. At each conversational turn , the generator models the conditional distribution , where denotes the dialogue context and is an injected noise vector, drawn either at the utterance level () or the word level ( for each step ).
The discriminator is a word-level bidirectional RNN, sharing both the context-RNN and word embeddings with for tight parameter coupling. For a given dialogue history 0 and response candidate 1 (real or generated), 2 outputs per-word authenticity scores and aggregates them across the sequence: 3. Training proceeds via the minimax GAN objective: 4 learns to distinguish real from synthetic responses, while 5 seeks both to maximize log-likelihood under teacher forcing and to fool 6 into accepting generated content as real.
2. Mathematical Formalism
The generator factorizes the conditional generation as
7
Teacher forcing replaces 8 with true prefix 9 in training.
The conditional-GAN loss is
0
Combined with maximum likelihood estimation,
1
the joint training objective is
2
where typically 3 (Olabiyi et al., 2018).
3. System Architecture
Generator:
- Four GRU-based RNNs (3 layers each, hidden size 512)
- eRNN (utterance encoder; bidirectional)
- cRNN (context encoder; unidirectional)
- aRNN (attention encoder; bidirectional)
- dRNN (decoder; unidirectional)
- Shared 512-dimensional word embeddings
- Local attention (Bahdanau or Luong) over last input utterance, computing 4 at each decoding step
- Noise injection: either utterance-level or word-level; concatenated to the decoder input as 5
Discriminator:
- Shares eRNN, aRNN, cRNN, word embeddings with 6
- 3-layer bidirectional GRU (hidden size 512) as 7, initialized from 8's final state 9
- Aggregates word-level predictions: 0
4. Inference and Candidate Ranking
During inference, for dialogue context 1, 2 noise vectors 3 are sampled; increasing the noise variance parameter 4 (5) expands response diversity. Each 6 produces a candidate response 7 via greedy decoding. All candidates are scored by 8, with optional log-probability fusion: 9 The highest-ranked candidate is output as the response (Olabiyi et al., 2018).
5. Training Strategy and Hyperparameter Configuration
- Optimizer: stochastic gradient descent (SGD), initial learning rate 0.5, decayed by 0.99 if adversarial loss plateaus for two iterations
- Mini-batch size: 64 conversations; gradient clipping at norm 5.0
- Vocabulary size: 50,000; sampled softmax for training, full softmax for evaluation
- Discriminator update protocol: if D-accuracy < 0.99, update 0; if D-accuracy < 0.75, update 1 using only MLE; otherwise jointly update 2 using both MLE and GAN losses
- Xavier initialization for all RNNs
- 3
6. Empirical Results
Extensive evaluation on Movie Triples Corpus (MTC) and Ubuntu Dialogue Corpus (UDC) demonstrates hredGAN's empirical gains over baseline HRED and variational VHRED (summarized below):
| Model | MTC Perplexity | UDC Perplexity | BLEU-2 (MTC/UDC) | ROUGE-2 (MTC/UDC) | Human Eval (MTC/UDC) |
|---|---|---|---|---|---|
| HRED | 31.9/36.0 | 69.4/86.4 | 0.0474/0.0177 | 0.0384/0.0483 | 0.256/0.347 |
| VHRED | 42.6/45.0 | 98.5/105.2 | 0.0606/0.0171 | 0.1181/0.0855 | 0.391/0.405 |
| hredGAN_u | 23.6/23.5 | 56.8/57.3 | 0.0493/0.0137 | 0.2416/0.0716 | 0.558/0.613 |
| hredGAN_w | 24.2/24.1 | 47.7/48.2 | 0.0613/0.0216 | 0.3244/0.1168 | 0.787/0.691 |
hredGAN achieves lower perplexity and substantially higher BLEU, ROUGE, and Distinct-n scores. Word-level noise injection (hredGAN_w) delivers the strongest improvements in informativeness, utterance relevance, and topic coverage. Human evaluation (normalized quality score, 0–1 scale) corroborates automatic metrics, with hredGAN_w attaining 0.787 (MTC) and 0.691 (UDC), compared to 0.256/0.347 for HRED and 0.391/0.405 for VHRED (Olabiyi et al., 2018).
7. Extensions: Persona Conditioning and phredGAN
Subsequent research extends hredGAN to persona-conditioned dialogue generation (phredGAN) by incorporating external attributes such as speaker identity, location, or subtopic into both the encoder and decoder RNNs (Olabiyi et al., 2019). Persona attributes are embedded and concatenated at each turn, conditioning sequential context representations and driving the generator toward speaker-consistent output modes. Empirical evaluation shows that phredGAN improves perplexity, BLEU, ROUGE, and distinct-n scores over both the original persona-seq2seq and hredGAN:
| Model | TV Perplexity | TV BLEU-4 (%) | TV ROUGE-2 | TV Distinct-1/2 | UDC Perplexity | UDC ROUGE-2 | UDC Distinct-1/2 |
|---|---|---|---|---|---|---|---|
| Speaker-only | 25.0 | 1.88 | - | - | - | - | - |
| Speaker-Addressee | 25.4 | 1.90 | - | - | - | - | - |
| phredGAN_u | 25.9 | 3.00 | 0.4044 | 0.1765/0.2164 | - | - | - |
| hredGAN_w | - | - | - | - | 48.18 | 0.1252 | 14.05/31.24 |
| phredGAN_w | - | - | - | - | 27.30 | 0.1692 | 20.12/24.53 |
phredGAN yields persona-consistent and informative multi-turn dialogues in both entertainment and customer service domains. A plausible implication is that explicit attribute conditioning supports robust persona imitation and enhances contextually appropriate response generation, though it relies on the availability of accurate persona annotations (Olabiyi et al., 2019).