Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts (2405.01121v3)
Abstract: Automating data generation with LLMs has become increasingly popular. In this work, we investigate the feasibility and effectiveness of LLM-based data generation in the challenging setting of source-grounded information-seeking dialogs, with response attribution, over long documents. Our source texts consist of long and noisy meeting transcripts, adding to the task complexity. Since automating attribution remains difficult, we propose a semi-automatic approach: dialog queries and responses are generated with LLMs, followed by human verification and identification of attribution spans. Using this approach, we created MISeD -- Meeting Information Seeking Dialogs dataset -- a dataset of information-seeking dialogs focused on meeting transcripts. Models finetuned with MISeD demonstrate superior performance compared to off-the-shelf models, even those of larger size. Finetuning on MISeD gives comparable response generation quality to finetuning on fully manual data, while improving attribution quality and reducing time and effort.
- Topiocqa: Open-domain conversational question answering with topic switching.
- Open-domain question answering goes conversational via question rewriting.
- Meeqa: Natural questions in meeting transcripts.
- A synthetic data generation framework for grounded dialogues. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10866–10882, Toronto, Canada. Association for Computational Linguistics.
- MultiWOZ - a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026, Brussels, Belgium. Association for Computational Linguistics.
- Doqa – accessing domain-specific faqs via conversational qa.
- The AMI meeting corpus: A pre-announcement. In Machine Learning for Multimodal Interaction, Second International Workshop, MLMI 2005, Edinburgh, UK, July 11-13, 2005, Revised Selected Papers, volume 3869 of Lecture Notes in Computer Science, pages 28–39. Springer.
- QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174–2184, Brussels, Belgium. Association for Computational Linguistics.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Wizard of wikipedia: Knowledge-powered conversational agents.
- doc2dial: A goal-oriented document-grounded dialogue dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8118–8128, Online. Association for Computational Linguistics.
- Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627.
- Trueteacher: Learning factual consistency evaluation with large language models.
- Gemini Team Google. 2023. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations. In Proc. Interspeech 2019, pages 1891–1895.
- Longt5: Efficient text-to-text transformer for long sequences.
- TRUE: Re-evaluating factual consistency evaluation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3905–3920, Seattle, United States. Association for Computational Linguistics.
- Meetingbank: A benchmark dataset for meeting summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 16409–16423. Association for Computational Linguistics.
- The ICSI meeting corpus. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’03, Hong Kong, April 6-10, 2003, pages 364–367. IEEE.
- J. F. Kelley. 1984. An iterative design methodology for user-friendly natural language office information applications. ACM Trans. Inf. Syst., 2(1):26–41.
- ExplainMeetSum: A dataset for explainable meeting summarization aligned with human intent. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13079–13098, Toronto, Canada. Association for Computational Linguistics.
- Newsdialogues: Towards proactive news grounded conversation. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Knowledge-grounded conversational data augmentation with generative conversational networks. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 26–38, Edinburgh, UK. Association for Computational Linguistics.
- Evaluating verifiability in generative search engines. arXiv preprint arXiv:2304.09848.
- HybriDialogue: An information-seeking dialogue dataset grounded on tabular and textual data. In Findings of the Association for Computational Linguistics: ACL 2022, pages 481–492, Dublin, Ireland. Association for Computational Linguistics.
- ELITR minuting corpus: A novel dataset for automatic minuting from multi-party meetings in English and Czech. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3174–3182, Marseille, France. European Language Resources Association.
- MeetingQA: Extractive question-answering on meeting transcripts. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15000–15025, Toronto, Canada. Association for Computational Linguistics.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
- Measuring attribution in natural language generation models. Computational Linguistics, pages 1–64.
- CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266.
- Interpretation of natural language rules in conversational machine reading. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2087–2097, Brussels, Belgium. Association for Computational Linguistics.
- Bleurt: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696.
- QAConv: Question answering on informative conversations. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5389–5411, Dublin, Ireland. Association for Computational Linguistics.
- DG2: Data augmentation through document grounded dialogue generation. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 204–216, Edinburgh, UK. Association for Computational Linguistics.
- Mug: A general meeting understanding and generation benchmark.
- Qmsum: A new benchmark for query-based multi-domain meeting summarization. CoRR, abs/2104.05938.
- Lotem Golany (1 paper)
- Filippo Galgani (2 papers)
- Maya Mamo (1 paper)
- Nimrod Parasol (1 paper)
- Omer Vandsburger (1 paper)
- Nadav Bar (4 papers)
- Ido Dagan (72 papers)