Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation (2408.00205v1)

Published 1 Aug 2024 in cs.CL and eess.AS

Abstract: This paper introduces a novel approach called sentence-wise speech summarization (Sen-SSum), which generates text summaries from a spoken document in a sentence-by-sentence manner. Sen-SSum combines the real-time processing of automatic speech recognition (ASR) with the conciseness of speech summarization. To explore this approach, we present two datasets for Sen-SSum: Mega-SSum and CSJ-SSum. Using these datasets, our study evaluates two types of Transformer-based models: 1) cascade models that combine ASR and strong text summarization models, and 2) end-to-end (E2E) models that directly convert speech into a text summary. While E2E models are appealing to develop compute-efficient models, they perform worse than cascade models. Therefore, we propose knowledge distillation for E2E models using pseudo-summaries generated by the cascade models. Our experiments show that this proposed knowledge distillation effectively improves the performance of the E2E model on both datasets.

PDF HTML Abstract

Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation

The paper "Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation" addresses the emerging challenge of summarizing spoken documents in real time via a proposed Sentence-wise Speech Summarization (Sen-SSum) framework. The research introduces new datasets, Mega-SSum and CSJ-SSum, and evaluates multiple model architectures including cascade and end-to-end (E2E) approaches, supplemented by a novel knowledge distillation technique.

Introduction to Sen-SSum

Traditional Automatic Speech Recognition (ASR) systems aim to transcribe spoken words verbatim, producing outputs that can be verbose and convoluted due to the inherent nature of spontaneous speech. On the other hand, existing speech summarization (SSum) techniques, while producing concise summaries, work in a batch mode, consuming complete spoken documents and therefore unsuitable for real-time applications.

Sen-SSum is proposed to blend the real-time transcript generation of ASR with the concise output of SSum by summarizing each speech sentence individually and incrementally. This incremental processing enables real-time updates after each utterance, without requiring the entire speech document's completion.

Datasets for Sen-SSum

Two datasets were introduced to explore Sen-SSum:

Mega-SSum - A large-scale English dataset derived from the Gigaword corpus, containing 3.8 million triplets of synthesized speech, transcriptions, and summaries. Voice generation is implemented using a state-of-the-art multi-speaker text-to-speech model.
CSJ-SSum - A Japanese dataset based on the Corpus of Spontaneous Japanese (CSJ), containing 38,000 triplets of real speech, transcriptions, and summaries. This dataset assesses the method's applicability to real-world scenarios and different languages.

The datasets are pivotal for advancing research in this field, allowing comprehensive evaluation across various configurations and contexts.

Methodology

Models

Two primary models are evaluated:

Cascade Models: These comprise a combination of ASR followed by text summarization (TSum). The TSum is pre-trained using extensive text data to enhance its summarization capability.
End-to-End (E2E) Models: These models directly transcribe speech to summary using a single encoder-decoder architecture, promoting computational efficiency and reducing latency.

Knowledge Distillation

A key innovation presented in this research is the knowledge distillation approach for E2E models. Given the limited availability of paired speech-summary data, pseudo-summaries generated by the cascade models are leveraged to augment the training of E2E models. This method utilizes the linguistic richness embedded in pre-trained TSum models to enhance E2E model performance.

Experimental Results

Mega-SSum Experiments

Baseline Performance: The cascade model (R-L: 36.0, BScr: 62.6) outperformed the standard E2E model (R-L: 30.7, BScr: 58.0), demonstrating the effectiveness of pre-trained summarization capabilities.
Impact of WavLM: Integrating the WavLM large model did not significantly enhance E2E performance, indicating its lesser effect compared to directly learned pseudo-summaries.
Knowledge Distillation: The proposed method using 3.75M pseudo-summaries substantially improved E2E model performance (R-L: 35.6, BScr: 61.9), nearing the cascade model's efficacy.

CSJ-SSum Experiments

In-domain vs. Out-of-domain: On the eval-CSJ (in-domain) set, the cascade model remained superior, but E2E models with knowledge distillation achieved competitive improvement. On eval-TED (out-of-domain) set, the benefits were more pronounced, highlighting better generalization through distillation.

Implications and Future Work

The findings have various implications:

Practical Applications: Sen-SSum can be widely applied in scenarios like live meeting transcription, online lectures, and other real-time information summarization tasks.
Model Development: Further exploration of enhancing knowledge distillation techniques, perhaps by integrating more sophisticated methods that avoid reliance on pseudo-summaries, is warranted.
Contextual Awareness: Future models could incorporate contextual information over sequences of sentences to ensure coherent and logically consistent summarization outputs across longer sessions.

In conclusion, the research presents a viable solution for real-time speech summarization, advancing both theoretical understanding and practical applications in AI-supported speech processing.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Kohei Matsuura (26 papers)
Takanori Ashihara (28 papers)
Takafumi Moriya (30 papers)
Masato Mimura (46 papers)
Takatomo Kano (9 papers)
Atsunori Ogawa (15 papers)
Marc Delcroix (94 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1819189764039266490

https://twitter.com/javaeeeee1/status/1819485981063483486