Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Automated Customer Support (1809.00303v1)

Published 2 Sep 2018 in cs.CL

Abstract: Recent years have seen growing interest in conversational agents, such as chatbots, which are a very good fit for automated customer support because the domain in which they need to operate is narrow. This interest was in part inspired by recent advances in neural machine translation, esp. the rise of sequence-to-sequence (seq2seq) and attention-based models such as the Transformer, which have been applied to various other tasks and have opened new research directions in question answering, chatbots, and conversational systems. Still, in many cases, it might be feasible and even preferable to use simple information retrieval techniques. Thus, here we compare three different models:(i) a retrieval model, (ii) a sequence-to-sequence model with attention, and (iii) Transformer. Our experiments with the Twitter Customer Support Dataset, which contains over two million posts from customer support services of twenty major brands, show that the seq2seq model outperforms the other two in terms of semantics and word overlap.

This paper "Towards Automated Customer Support" (Hardalov et al., 2018 ) explores the use of automated systems for handling customer support inquiries, specifically in the context of social media platforms like Twitter. The increasing volume and variety of communication channels make traditional manual support costly and challenging to scale, especially for 24/7 availability. Chatbots are presented as a suitable solution due to their automation capabilities and the relatively narrow domain of customer support interactions.

The research compares three different approaches for building a customer support chatbot:

  1. Information Retrieval (IR): This method retrieves the most similar question from a historical dataset of user questions and support answers, returning the corresponding answer.
  2. Sequence-to-Sequence (Seq2Seq) with Attention: A neural network model that encodes the user's question and decodes a generated response, using an attention mechanism to focus on relevant parts of the input.
  3. Transformer: A neural network model based solely on attention mechanisms, known for state-of-the-art performance in machine translation and other sequence-to-sequence tasks.

The paper utilizes the Customer Support on Twitter Dataset, a large corpus of tweets and replies from customer support services of various brands. For their experiments, the authors focused specifically on the Apple support interactions due to the dataset's size. The dataset was filtered to remove conversations that redirect users to other channels and split temporally into training (earlier posts) and testing (latest posts) sets to simulate a real-world scenario where models respond to recent queries. The resulting dataset contained 49,626 dialog tuples (45,582 for training, 4,044 for testing). Statistics show that dialogs are relatively short (average 2.6 turns) and answers tend to be slightly longer than questions.

Key implementation and preprocessing steps included:

  • Using a specialized Twitter tokenizer to handle tweet-specific formatting.
  • Replacing shorthand, slang, URLs, user mentions (<user>), and hashtags (<hashtag>) with standardized tokens.
  • Concatenating previous turns to the current question to provide conversational context to the models.
  • Limiting the vocabulary to the top NN words by frequency and replacing others with a special unknown token (<unk>) for efficient training of neural models.

The specific model configurations used were:

  • IR: Implemented using ElasticSearch with the BM25 retrieval algorithm. The preprocessed training data was indexed, and previous turns were appended to both training and testing queries as context. Retrieval involved finding the top-ranked matching question for a test query and returning its associated answer.
  • Seq2Seq: Employed a single layer bidirectional LSTM with 512 units per direction (1024 total) for the encoder and two unidirectional LSTM layers for the decoder. Word embeddings were 200-dimensional, combining pre-trained GloVe vectors (from 27B Twitter posts) for known words and learned positional embeddings for unknown words. Encoder and decoder embeddings were not shared. The vocabulary size was limited to the top 8,192 words. Sequence length was capped at 60 words. Dropout (0.8 keep probability) and the Adam optimizer (initial learning rate 1e-03, decay 0.99/epoch) were used.
  • Transformer: Consisted of two identical encoder and decoder layers, each with four attention heads. Model dimensionality (dmodeld_{model}) was 256, and inner dimensionality (dinnerd_{inner}) was 512. Input/output embeddings were learned separately with sinusoidal positional encoding. Dropout was 0.9 keep probability. Adam optimizer was used with a varying learning rate based on the formula: lrate=dmodel0.5min(step_num0.5,step_numwarmup_steps1.5)lrate = d_{model}^{-0.5} \cdot \min{(step\_num^{-0.5}, step\_num \cdot warmup\_steps^{-1.5})}.

Chatbot evaluation is challenging, and standard machine translation/text summarization metrics (like BLEU and ROUGE) which focus on word overlap may not fully capture semantic similarity in conversational contexts. Therefore, the authors used a combination of:

  • Word Overlap Measures: BLEU@2 and ROUGE-L.
  • Semantic Evaluation Measures: Embedding Average (cosine similarity of averaged word embeddings), Greedy Matching (average cosine similarity of best word matches), and Vector Extrema (cosine similarity of vectors derived from coordinate-wise max/min of word embeddings). For semantic measures, pre-trained Google word2vec embeddings were used to avoid bias from training data.

The experimental results showed that the Seq2Seq model performed best across all five evaluation measures (BLEU@2, ROUGE-L, Embedding Average, Greedy Matching, Vector Extrema). It achieved the highest scores for both word-overlap and semantic measures. The Transformer model was ranked second by most measures (Greedy Matching, Vector Extrema, ROUGE-L) but surprisingly performed slightly worse than IR on BLEU@2 and Embedding Average. The IR model performed second best on BLEU@2 and Embedding Average but was the lowest performer on Greedy Matching, Vector Extrema, and ROUGE-L.

The discussion highlights that while Transformer achieved state-of-the-art results in other domains, Seq2Seq with attention was more effective for this specific customer support task, possibly due to dataset characteristics or model tuning. The lower scores of Transformer on BLEU@2 and ROUGE-L suggest it might be generating semantically relevant responses but with different vocabulary or word order compared to the gold answers. The IR model's strength lies in its ability to provide accurate answers when an almost identical question exists in the training data, particularly useful for rare or specific issues where neural models might hallucinate irrelevant responses. However, IR struggles when the user's query isn't a close match to any historical question.

In conclusion, the paper demonstrates that generative neural models, particularly Seq2Seq with attention, outperform retrieval-based methods for automated customer support on Twitter based on the chosen dataset and evaluation metrics. However, they note that neural models can fail when faced with questions unlike those in the training data, a scenario where IR can sometimes be more reliable if a good match exists. Future work suggested includes exploring ensemble models combining the strengths of different approaches and developing methods to handle answers that change over time.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Momchil Hardalov (23 papers)
  2. Ivan Koychev (33 papers)
  3. Preslav Nakov (253 papers)
Citations (32)