Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LiveChat: A Large-Scale Personalized Dialogue Dataset Automatically Constructed from Live Streaming (2306.08401v1)

Published 14 Jun 2023 in cs.CL

Abstract: Open-domain dialogue systems have made promising progress in recent years. While the state-of-the-art dialogue agents are built upon large-scale text-based social media data and large pre-trained models, there is no guarantee these agents could also perform well in fast-growing scenarios, such as live streaming, due to the bounded transferability of pre-trained models and biased distributions of public datasets from Reddit and Weibo, etc. To improve the essential capability of responding and establish a benchmark in the live open-domain scenario, we introduce the LiveChat dataset, composed of 1.33 million real-life Chinese dialogues with almost 3800 average sessions across 351 personas and fine-grained profiles for each persona. LiveChat is automatically constructed by processing numerous live videos on the Internet and naturally falls within the scope of multi-party conversations, where the issues of Who says What to Whom should be considered. Therefore, we target two critical tasks of response modeling and addressee recognition and propose retrieval-based baselines grounded on advanced techniques. Experimental results have validated the positive effects of leveraging persona profiles and larger average sessions per persona. In addition, we also benchmark the transferability of advanced generation-based models on LiveChat and pose some future directions for current challenges.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  2. Pre-training with whole word masking for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3504–3514.
  3. Cristian Danescu and Lillian Lee. 2011. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. arXiv preprint arXiv:1106.3077.
  4. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  5. GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, Dublin, Ireland. Association for Computational Linguistics.
  6. Micha Elsner and Eugene Charniak. 2008. You talking to me? a corpus and algorithm for conversation disentanglement. In Proceedings of ACL-08: HLT, pages 834–842, Columbus, Ohio. Association for Computational Linguistics.
  7. MEISD: A multimodal multi-label emotion, intensity and sentiment dialogue dataset for emotion recognition and sentiment analysis in conversations. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4441–4453, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  8. Speaker-aware BERT for multi-turn response selection in retrieval-based chatbots. In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, pages 2041–2044. ACM.
  9. MPC-BERT: A pre-trained language model for multi-party conversation understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3682–3692, Online. Association for Computational Linguistics.
  10. Eva2. 0: Investigating open-domain chinese dialogue systems with large-scale pre-training. arXiv preprint arXiv:2203.09313.
  11. Challenges in building intelligent open-domain dialog systems. ACM Transactions on Information Systems (TOIS), 38(3):1–32.
  12. Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. arXiv preprint arXiv:1905.01969.
  13. Acquisition and use of long-term memory for personalized dialog systems. In International workshop on multimodal analyses enabling artificial agents in human-machine interaction, pages 78–87. Springer.
  14. A large-scale corpus for conversation disentanglement. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3846–3856, Florence, Italy. Association for Computational Linguistics.
  15. Who is speaking to whom? learning to identify utterance addressee in multi-party conversations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1909–1919, Hong Kong, China. Association for Computational Linguistics.
  16. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 994–1003, Berlin, Germany. Association for Computational Linguistics.
  17. Dialogue history matters! personalized response selection in multi-turn retrieval-based chatbots. ACM Transactions on Information Systems (TOIS), 39(4):1–25.
  18. The Ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 285–294, Prague, Czech Republic. Association for Computational Linguistics.
  19. Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 2645–2652.
  20. Khyati Mahajan and Samira Shaikh. 2021. On the need for thoughtful data collection for multi-party dialogue: A survey of available corpora and collection methods. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 338–352, Singapore and Online. Association for Computational Linguistics.
  21. Interview: A large-scale open-source corpus of media dialog. arXiv preprint arXiv:2004.03090.
  22. Training millions of personalized dialogue agents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2775–2779, Brussels, Belgium. Association for Computational Linguistics.
  23. Hiroki Ouchi and Yuta Tsuboi. 2016. Addressee and response selection for multi-party conversation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2133–2143, Austin, Texas. Association for Computational Linguistics.
  24. Pchatbot: A large-scale dataset for personalized chatbot. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2470–2477.
  25. Assigning personality/profile to a chatting machine for coherent conversation generation. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 4279–4285. International Joint Conferences on Artificial Intelligence Organization.
  26. Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325, Online. Association for Computational Linguistics.
  27. Cpt: A pre-trained unbalanced transformer for both chinese language understanding and generation. arXiv preprint arXiv:2109.05729.
  28. Exploiting persona information for diverse generation of conversational responses. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 5190–5196. International Joint Conferences on Artificial Intelligence Organization.
  29. Lamda: Language models for dialog applications. ArXiv, abs/2201.08239.
  30. A large-scale chinese short-text conversation dataset. In CCF International Conference on Natural Language Processing and Chinese Computing, pages 91–103. Springer.
  31. Apiradee Wongkitrungrueng and Nuttapol Assarut. 2020. The role of live streaming in building consumer trust and engagement with social commerce sellers. Journal of Business Research, 117:543–556.
  32. Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 496–505, Vancouver, Canada. Association for Computational Linguistics.
  33. RealMedDial: A real telemedical dialogue dataset collected from online Chinese short-video clips. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3342–3352, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  34. Beyond goldfish memory: Long-term open-domain conversation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5180–5197, Dublin, Ireland. Association for Computational Linguistics.
  35. Long time no see! open-domain conversation with long-term persona memory. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2639–2650, Dublin, Ireland. Association for Computational Linguistics.
  36. Socratic models: Composing zero-shot multimodal reasoning with language. ArXiv, abs/2204.00598.
  37. Addressee and response selection in multi-party conversations with speaker interaction rnns. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
  38. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
  39. Consistent dialogue generation with self-supervised feature learning. arXiv preprint arXiv:1903.05759.
  40. Personalized dialogue generation with diversified traits.
  41. Less is more: Learning to refine dialogue history for personalized dialogue generation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5808–5820, Seattle, United States. Association for Computational Linguistics.
  42. Towards persona-based empathetic conversational models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6556–6566, Online. Association for Computational Linguistics.
  43. Eva: An open-domain chinese dialogue system with large-scale generative pre-training.
  44. The Design and Implementation of XiaoIce, an Empathetic Social Chatbot. Computational Linguistics, 46(1):53–93.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jingsheng Gao (16 papers)
  2. Yixin Lian (7 papers)
  3. Ziyi Zhou (33 papers)
  4. Yuzhuo Fu (24 papers)
  5. Baoyuan Wang (46 papers)
Citations (9)

Summary

We haven't generated a summary for this paper yet.