Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pchatbot: A Large-Scale Dataset for Personalized Chatbot (2009.13284v3)

Published 28 Sep 2020 in cs.CL and cs.AI

Abstract: Natural language dialogue systems raise great attention recently. As many dialogue models are data-driven, high-quality datasets are essential to these systems. In this paper, we introduce Pchatbot, a large-scale dialogue dataset that contains two subsets collected from Weibo and Judicial forums respectively. To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization, deduplication, segmentation, and filtering. The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models. Besides, current dialogue datasets for personalized chatbot usually contain several persona sentences or attributes. Different from existing datasets, Pchatbot provides anonymized user IDs and timestamps for both posts and responses. This enables the development of personalized dialogue models that directly learn implicit user personality from the user's dialogue history. Our preliminary experimental study benchmarks several state-of-the-art dialogue models to provide a comparison for future work. The dataset can be publicly accessed at Github.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Hongjin Qian (23 papers)
  2. Xiaohe Li (8 papers)
  3. Hanxun Zhong (3 papers)
  4. Yu Guo (186 papers)
  5. Yueyuan Ma (2 papers)
  6. Yutao Zhu (63 papers)
  7. Zhanliang Liu (1 paper)
  8. Zhicheng Dou (113 papers)
  9. Ji-Rong Wen (299 papers)
Citations (39)

Summary

We haven't generated a summary for this paper yet.