Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Doc2Bot: Accessing Heterogeneous Documents via Conversational Bots (2210.11060v3)

Published 20 Oct 2022 in cs.CL

Abstract: This paper introduces Doc2Bot, a novel dataset for building machines that help users seek information via conversations. This is of particular interest for companies and organizations that own a large number of manuals or instruction books. Despite its potential, the nature of our task poses several challenges: (1) documents contain various structures that hinder the ability of machines to comprehend, and (2) user information needs are often underspecified. Compared to prior datasets that either focus on a single structural type or overlook the role of questioning to uncover user needs, the Doc2Bot dataset is developed to target such challenges systematically. Our dataset contains over 100,000 turns based on Chinese documents from five domains, larger than any prior document-grounded dialog dataset for information seeking. We propose three tasks in Doc2Bot: (1) dialog state tracking to track user intentions, (2) dialog policy learning to plan system actions and contents, and (3) response generation which generates responses based on the outputs of the dialog policy. Baseline methods based on the latest deep learning models are presented, indicating that our proposed tasks are challenging and worthy of further research.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Haomin Fu (2 papers)
  2. Yeqin Zhang (5 papers)
  3. Haiyang Yu (109 papers)
  4. Jian Sun (414 papers)
  5. Fei Huang (408 papers)
  6. Luo Si (73 papers)
  7. Yongbin Li (128 papers)
  8. Cam-Tu Nguyen (15 papers)
Citations (28)