Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An End-to-End Speech Summarization Using Large Language Model (2407.02005v1)

Published 2 Jul 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Abstractive Speech Summarization (SSum) aims to generate human-like text summaries from spoken content. It encounters difficulties in handling long speech input and capturing the intricate cross-modal mapping between long speech inputs and short text summaries. Research on LLMs and multimodal information fusion has provided new insights for addressing these challenges. In this paper, we propose an end-to-end SSum model that utilizes Q-Former as a connector for the audio-text modality and employs LLMs to generate text summaries directly from speech features. We adopt a multi-stage training approach that includes LLM based ASR and Text Summarization (TSum) tasks as auxiliary tasks. ASR tasks are used to align feature spaces and enhance the LLM's ability to handle longer speech. Then, we utilize a curriculum learning strategy to facilitate the model's transition from TSum to SSum. Finally, our model achieves competitive performance on the How-2 dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. G. Murray, G. Carenini, and R. Ng, “Generating and validating abstracts of meeting conversations: a user study,” in Proceedings of the 6th international natural language generation conference, 2010.
  2. R. Sharma, M. Mahrishi, S. Morwal, and G. Sharma, “Index point detection for text summarization using cosine similarity in educational videos,” in IOP Conference Series, vol. 1131, no. 1, 2021, p. 012001.
  3. G. Finley, E. Edwards, A. Robinson, M. Brenndoerfer, N. Sadoughi, J. Fone, N. Axtmann, M. Miller, and D. Suendermann-Oeft, “An automated medical scribe for documenting clinical encounters,” in 2018 Conference of the North American Chapter of the ACL, 2018, pp. 11–15.
  4. J. L. Neto, A. A. Freitas, and C. A. Kaestner, “Automatic text summarization using a machine learning approach,” in Advances in Artificial Intelligence: 16th Brazilian Symposium on Artificial Intelligence.   Springer, 2002, pp. 205–215.
  5. S. Palaskar, J. Libovickỳ, S. Gella, and F. Metze, “Multimodal abstractive summarization for how2 videos,” arXiv preprint arXiv:1906.07901, 2019.
  6. S. Shon, S. Arora, C.-J. Lin, A. Pasad, F. Wu, R. Sharma, W.-L. Wu, H.-Y. Lee, K. Livescu, and S. Watanabe, “Slue phase-2: A benchmark suite of diverse spoken language understanding tasks,” arXiv preprint arXiv:2212.10525, 2022.
  7. S. Palaskar, R. Salakhutdinov, A. W. Black, and F. Metze, “Multimodal speech summarization through semantic concept learning.” in Interspeech, 2021, pp. 791–795.
  8. R. Sharma, S. Palaskar, A. W. Black, and F. Metze, “End-to-end speech summarization using restricted self-attention,” in ICASSP 2022-2022 (ICASSP).   IEEE, 2022, pp. 8072–8076.
  9. T. Kano, A. Ogawa, M. Delcroix, R. Sharma, K. Matsuura, and S. Watanabe, “Speech summarization of long spoken document: Improving memory efficiency of speech/text encoders,” in ICASSP 2023-2023 (ICASSP).   IEEE, 2023, pp. 1–5.
  10. K. Matsuura, T. Ashihara, T. Moriya, T. Tanaka, A. Ogawa, M. Delcroix, and R. Masumura, “Leveraging large text corpora for end-to-end speech summarization,” in ICASSP 2023-2023 (ICASSP).   IEEE, 2023, pp. 1–5.
  11. R. Sharma, K. Zheng, S. Arora, S. Watanabe, R. Singh, and B. Raj, “Bass: Block-wise adaptation for speech summarization,” arXiv preprint arXiv:2307.08217, 2023.
  12. I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-document transformer,” arXiv preprint arXiv:2004.05150, 2020.
  13. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  14. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  15. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  16. W. Yu, C. Tang, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Connecting speech encoder and large language model for asr,” arXiv preprint arXiv:2309.13963, 2023.
  17. D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y. Zhou, and X. Qiu, “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” arXiv preprint arXiv:2305.11000, 2023.
  18. P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. d. C. Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonov et al., “Audiopalm: A large language model that can speak and listen,” arXiv preprint arXiv:2306.12925, 2023.
  19. J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International conference on machine learning.   PMLR, 2023, pp. 19 730–19 742.
  20. Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 41–48.
  21. R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault, L. Specia, and F. Metze, “How2: a large-scale dataset for multimodal language understanding,” arXiv preprint arXiv:1811.00347, 2018.
  22. K. Matsuura, T. Ashihara, T. Moriya, T. Tanaka, T. Kano, A. Ogawa, and M. Delcroix, “Transfer learning from pre-trained language models improves end-to-end speech summarization,” arXiv preprint arXiv:2306.04233, 2023.
  23. J.-w. Jung, R. Sharma, W. Chen, B. Raj, and S. Watanabe, “Augsumm: towards generalizable speech summarization using synthetic labels from large language model,” arXiv preprint arXiv:2401.06806, 2024.
  24. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
  25. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  26. C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74–81.
  27. S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop, 2005, pp. 65–72.
  28. T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” arXiv preprint arXiv:1904.09675, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Hengchao Shang (22 papers)
  2. Zongyao Li (23 papers)
  3. Jiaxin Guo (40 papers)
  4. Shaojun Li (13 papers)
  5. Zhiqiang Rao (12 papers)
  6. Yuanchang Luo (13 papers)
  7. Daimeng Wei (31 papers)
  8. Hao Yang (328 papers)