Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Going beyond research datasets: Novel intent discovery in the industry setting (2305.05474v1)

Published 9 May 2023 in cs.CL and cs.LG

Abstract: Novel intent discovery automates the process of grouping similar messages (questions) to identify previously unknown intents. However, current research focuses on publicly available datasets which have only the question field and significantly differ from real-life datasets. This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform. We show the benefit of pre-training LLMs on in-domain data: both self-supervised and with weak supervision. We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv. All our methods combined to fully utilize real-life datasets give up to 33pp performance boost over state-of-the-art Constrained Deep Adaptive Clustering (CDAC) model for question only. By comparison CDAC model for the question data only gives only up to 13pp performance boost over the naive baseline.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Aleksandra Chrabrowa (2 papers)
  2. Tsimur Hadeliya (2 papers)
  3. Dariusz Kajtoch (10 papers)
  4. Robert Mroczkowski (4 papers)
  5. Piotr Rybak (10 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.