The TechQA Dataset (1911.02984v1)
Abstract: We introduce TechQA, a domain-adaptation question answering dataset for the technical support domain. The TechQA corpus highlights two real-world issues from the automated customer support domain. First, it contains actual questions posed by users on a technical forum, rather than questions generated specifically for a competition or a task. Second, it has a real-world size -- 600 training, 310 dev, and 490 evaluation question/answer pairs -- thus reflecting the cost of creating large labeled datasets with actual data. Consequently, TechQA is meant to stimulate research in domain adaptation rather than being a resource to build QA systems from scratch. The dataset was obtained by crawling the IBM Developer and IBM DeveloperWorks forums for questions with accepted answers that appear in a published IBM Technote---a technical document that addresses a specific technical issue. We also release a collection of the 801,998 publicly available Technotes as of April 4, 2019 as a companion resource that might be used for pretraining, to learn representations of the IT domain language.
- Vittorio Castelli (24 papers)
- Rishav Chakravarti (11 papers)
- Saswati Dana (6 papers)
- Anthony Ferritto (10 papers)
- Radu Florian (54 papers)
- Martin Franz (9 papers)
- Dinesh Garg (20 papers)
- Dinesh Khandelwal (13 papers)
- Scott McCarley (6 papers)
- Mike McCawley (1 paper)
- Mohamed Nasr (3 papers)
- Lin Pan (23 papers)
- Cezar Pendus (4 papers)
- John Pitrelli (1 paper)
- Saurabh Pujar (14 papers)
- Salim Roukos (41 papers)
- Andrzej Sakrajda (2 papers)
- Avirup Sil (45 papers)
- Rosario Uceda-Sosa (8 papers)
- Todd Ward (4 papers)