Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Machine Reading Comprehension for Answer Re-Ranking in Customer Support Chatbots (1902.04574v2)

Published 12 Feb 2019 in cs.CL

Abstract: Recent advances in deep neural networks, LLMing and language generation have introduced new ideas to the field of conversational agents. As a result, deep neural models such as sequence-to-sequence, Memory Networks, and the Transformer have become key ingredients of state-of-the-art dialog systems. While those models are able to generate meaningful responses even in unseen situation, they need a lot of training data to build a reliable model. Thus, most real-world systems stuck to traditional approaches based on information retrieval and even hand-crafted rules, due to their robustness and effectiveness, especially for narrow-focused conversations. Here, we present a method that adapts a deep neural architecture from the domain of machine reading comprehension to re-rank the suggested answers from different models using the question as context. We train our model using negative sampling based on question-answer pairs from the Twitter Customer Support Dataset.The experimental results show that our re-ranking framework can improve the performance in terms of word overlap and semantics both for individual models as well as for model combinations.

This paper "Machine Reading Comprehension for Answer Re-Ranking in Customer Support Chatbots" (Hardalov et al., 2019 ) addresses the challenge of building robust customer support chatbots, particularly when faced with limited training data for specific domains. While generative models like sequence-to-sequence and Transformer architectures can produce novel responses, they require large datasets. Information Retrieval (IR) based systems are often more robust in data-scarce environments but lack the flexibility of generation.

The authors propose a framework that combines the strengths of different models by using a Machine Reading Comprehension (MRC) approach for answer re-ranking. The core idea is to train a classifier that evaluates the "goodness" of a candidate question-answer pair. This classifier then scores potential answers provided by various source models (e.g., IR, seq2seq), and these scores are used to select the best response.

Model Architecture and Adaptation

The re-ranking model is based on QANet [yu2018qanet], a state-of-the-art architecture for MRC. Typically, MRC models take a context paragraph and a question and output a start and end index within the context corresponding to the answer span. In this work, the model is adapted to treat the user's question as the "context" and a candidate answer as the "potential answer span". The goal is not to extract a span, but to classify the relationship between the question and the candidate answer.

The QANet architecture used includes:

  1. Embedding Layer: Converts words into dense vectors using either pre-trained GloVe [pennington2014glove] or ELMo [Peters:2018:ELMo] embeddings. A highway network [srivastava2015highway] is added on top for gated information flow. Embeddings for questions and answers are learned separately.
  2. Embedding Encoder: Processes the embeddings using a combination of convolution, self-attention [NIPS2017_7181:transformer], and a feed-forward network within residual blocks [ba2016layer].
  3. Attention Layer: Computes bidirectional attention between the encoded question and answer representations (Answer-to-Question and Question-to-Answer attention).
  4. Model Layer: Takes the concatenated representations (answer embedding, A2Q attention, element-wise product of answer and A2Q, element-wise product of answer and Q2A) and passes them through residual blocks.
  5. Output Layer: A linear layer applied to the output of the model layer produces a score for the question-answer pair. This score is then passed through a sigmoid function to get a probability indicating the likelihood that the candidate answer is a good fit for the question.

The model is trained to minimize a binary cross-entropy loss on question-answer pairs.

Data Preparation and Negative Sampling

The dataset used is the Twitter Customer Support Dataset [Twitter], focusing on Apple support conversations. Since the original dataset only contains "good" (user question, support answer) pairs, negative sampling [NIPS2013:w2v] is employed to create "bad" pairs. This is done by pairing a user question with a random answer from another conversation in the training set. A check is performed to ensure randomly sampled answers that are too similar (based on cosine similarity) to the original correct answer are not labeled as "bad". The dataset is split chronologically for training and testing to simulate a real-world scenario. Preprocessing involves using a specialized Twitter tokenizer [manning2014stanford] and handling specific Twitter elements like URLs, mentions, and hashtags. Questions and answers are trimmed to fixed lengths (60 and 70 words respectively).

Answer Re-Ranking and Selection

Given a user question and a set of candidate answers generated by different source models (like IR or seq2seq), the trained QANet classifier scores each candidate answer. The candidates are then re-ranked based on these scores.

Two answer selection strategies are explored:

  1. Max Strategy: Selects the candidate answer with the highest goodness score predicted by the QANet model.
  2. Proportional Sampling (Softmax): Treats the scores (before the sigmoid activation) of all candidate answers as logits for a categorical distribution. It samples an answer probabilistically, where the probability of selecting an answer is proportional to its score via a softmax function. This strategy aims to introduce diversity and implicitly up-vote answers that appear frequently among the top candidates from different source models.

Implementation Details and Experimental Setup

  • Optimizer: Adam [kingma2015adam] with a decaying learning rate.
  • Regularization: Dropout [srivastava2014dropout] and L2 weight decay.
  • Evaluation: The performance is evaluated using both word-overlap metrics (BLEU@2 [papineni2002bleu], ROUGE-L [lin-och:2004:ACL]) and semantic similarity metrics (Embedding Average [lowe2015ubuntu], Greedy Matching [rus2012comparison], Vector Extrema [forgues2014bootstrapping]) using pre-trained word2vec embeddings.
  • Source Models: IR (ElasticSearch with BM25 [Robertson:2009:PRF:170(4809.17048)10]), seq2seq (bi-directional LSTM), and Transformer models are used to generate initial candidate sets.

Results and Findings

  • Auxiliary Classification Task: The QANet model achieves high accuracy (up to 85.45%) in distinguishing good from bad question-answer pairs. Contextualized ELMo embeddings, specifically sentence-level ELMo, perform better than GloVe or token-level ELMo for this task.
  • Individual Models: seq2seq outperforms IR and Transformer on the tested customer support dataset, consistent with previous work [hardalov:10.1007/978-3-319-99344-7_5:customer].
  • Re-ranking Individual Models: Applying QANet re-ranking to the top-K candidates from a single IR model significantly improves performance across all metrics compared to using the top IR result directly.
  • Multi-Source Re-ranking: Re-ranking the top-K candidates from a combination of IR and seq2seq models shows further improvements over individual models. The QANet re-ranker with sentence-level ELMo embeddings performs best overall, achieving the highest scores on BLEU@2, ROUGE_L, and Greedy Matching.
  • Softmax Sampling: The proportional sampling strategy based on softmax consistently yields better results than the greedy "max" strategy, especially on word-overlap metrics. This supports the hypothesis that introducing diversity and considering the relative scores rather than just the maximum improves the overall quality and naturalness of the chatbot's responses.

Practical Implications for Implementation

The paper demonstrates that re-ranking can be a highly effective strategy for improving chatbot performance by leveraging outputs from multiple potentially simpler or specialized source models. Instead of relying solely on a single, potentially brittle, complex model, a re-ranking approach allows for combining diverse answer generation mechanisms.

Implementing this framework involves:

  1. Setting up Source Models: Have one or more existing or purpose-built models (IR, seq2seq, etc.) capable of generating a list of candidate answers for a given user query.
  2. Data Collection and Labeling: Gather real-world question-answer pairs. This dataset will be used to train the re-ranker.
  3. Negative Sampling Implementation: Develop a process to automatically generate negative samples by pairing questions with incorrect answers. This is crucial for training the binary classifier.
  4. QANet Implementation: Adapt an existing QANet or MRC model implementation for the classification task. This involves modifying the output layer to predict a single score rather than span indices. Libraries like TensorFlow or PyTorch can be used. Pre-trained embeddings (GloVe, ELMo) should be integrated.
  5. Training the Re-ranker: Train the adapted QANet model on the prepared dataset using the question as context and the candidate answer as the text to be scored.
  6. Integration into Chatbot Pipeline: When a user asks a question:
    • Pass the question to the source models to get a list of top-K candidate answers.
    • Pass each (question, candidate answer) pair through the trained QANet re-ranker to get a score for each candidate.
    • Apply either the max or proportional sampling strategy based on the scores to select the final answer.
  7. Deployment Considerations: The re-ranking model adds latency to the response time. The computational cost depends on the QANet model size and the number of candidates being re-ranked. Optimizing the model and running it on appropriate hardware (GPUs) is necessary for real-time performance. The size of the candidate set (K) is a trade-off between potential answer quality and inference speed.

The use of contextualized embeddings like ELMo is shown to be beneficial, suggesting that incorporating more advanced language representations is a good implementation choice. The finding that proportional sampling improves results implies that building a robust re-ranking system should not solely rely on selecting the single highest-scoring answer, but potentially incorporate diversity or confidence weighting.

Future implementation efforts could explore integrating reinforcement learning as suggested by the authors to refine the answer selection policy based on user feedback or task completion signals, moving beyond static probabilistic sampling.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Momchil Hardalov (23 papers)
  2. Ivan Koychev (33 papers)
  3. Preslav Nakov (253 papers)
Citations (14)