AugSumm: towards generalizable speech summarization using synthetic labels from large language model (2401.06806v1)
Abstract: Abstractive speech summarization (SSUM) aims to generate human-like summaries from speech. Given variations in information captured and phrasing, recordings can be summarized in multiple ways. Therefore, it is more reasonable to consider a probabilistic distribution of all potential summaries rather than a single summary. However, conventional SSUM models are mostly trained and evaluated with a single ground-truth (GT) human-annotated deterministic summary for every recording. Generating multiple human references would be ideal to better represent the distribution statistically, but is impractical because annotation is expensive. We tackle this challenge by proposing AugSumm, a method to leverage LLMs as a proxy for human annotators to generate augmented summaries for training and evaluation. First, we explore prompting strategies to generate synthetic summaries from ChatGPT. We validate the quality of synthetic summaries using multiple metrics including human evaluation, where we find that summaries generated using AugSumm are perceived as more valid to humans. Second, we develop methods to utilize synthetic summaries in training and evaluation. Experiments on How2 demonstrate that pre-training on synthetic summaries and fine-tuning on GT summaries improves ROUGE-L by 1 point on both GT and AugSumm-based test sets. AugSumm summaries are available at https://github.com/Jungjee/AugSumm.
- “Interpretation and transformation for abstracting conversations,” in Proc. NAACL, 2010, pp. 894–902.
- “Multimodal abstractive summarization for how2 videos,” in Proc. ACL, July 2019.
- “Read, watch, listen, and summarize: Multi-modal summarization for asynchronous text, image, audio and video,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 5, pp. 996–1009, 2019.
- “Unsupervised abstractive meeting summarization with multi-sentence compression and budgeted submodular maximization,” in Proc. NAACL, 2018.
- “The formation of abstracts by the selection of sentences. part i. sentence selection by men and machines,” American Documentation, vol. 12, no. 2, pp. 139–141, 1961.
- “Overview of the first shared task on multi perspective scientific document summarization (mup),” in Proceedings of the Third Workshop on Scholarly Document Processing, 2022, pp. 263–267.
- “Using large language models to simulate multiple humans and replicate human subject studies,” in Proc. ICML, 2023.
- “An information-theoretic approach to prompt engineering without ground truth labels,” in Proc. ACL. 2022, Association for Computational Linguistics.
- “Out of one, many: Using language models to simulate human samples,” Political Analysis, vol. 31, no. 3, pp. 337–351, 2023.
- “How2: A large-scale dataset for multimodal language understanding,” in Proc. NeurIPS, 2018.
- Chin-Yew Lin, “Rouge: A package for automatic evaluation of summaries,” in Proc. ACL, 2004.
- “End-to-end speech summarization using restricted self-attention,” in Proc. ICASSP, 2022.
- “Xnor-former: Learning accurate approximations in long speech transformers,” arXiv preprint arXiv:2210.16643, 2022.
- “Transfer learning from pre-trained language models improves end-to-end speech summarization,” in Proc. Interspeech, 2023.
- “Leveraging large text corpora for end-to-end speech summarization,” in Proc. ICASSP, 2023.
- “SLUE phase-2: A benchmark suite of diverse spoken language understanding tasks,” in Proc. ACL, 2023.
- “Speech summarization of long spoken document: Improving memory efficiency of speech/text encoders,” in Proc. ICASSP, 2023.
- “A prompt pattern catalog to enhance prompt engineering with chatgpt,” arXiv preprint arXiv:2302.11382, 2023.
- “Multimodal speech summarization through semantic concept learning.,” in Proc. Interspeech, 2021, pp. 791–795.
- “Towards a unified multi-dimensional evaluator for text generation,” in Proc. EMNLP, 2022.
- “Bertscore: Evaluating text generation with bert,” in Proc. ICLR, 2019.
- “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech 2020, 2020.
- “Recent developments on espnet toolkit boosted by conformer,” in Proc. ICASSP, 2021.
- “Fnet: Mixing tokens with fourier transforms,” in Proc. NAACL, 2022.
- “The kaldi speech recognition toolkit,” in Proc. ASRU, Dec. 2011.
- “Hybrid CTC/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
- “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech, 2019, pp. 2613–2617.
- Jee-weon Jung (69 papers)
- Roshan Sharma (24 papers)
- William Chen (49 papers)
- Bhiksha Raj (180 papers)
- Shinji Watanabe (416 papers)