- The paper demonstrates that integrating end-to-end ASR features with an RNN self-attention mechanism improves sentiment classification accuracy from 66.6% to 71.7% on IEMOCAP and 70.1% on SWBD-sentiment.
- The methodology combines linguistic and acoustic cues through a simple RNN Transducer architecture and explores multiple sentiment decoders for interpretable predictions.
- This work paves the way for enhanced real-world applications in customer service, healthcare, and education by providing richer emotional insights from conversational speech.
Speech Sentiment Analysis via Pre-Trained Features from End-to-End ASR Models
The paper addresses the task of speech sentiment analysis by leveraging pre-trained features from end-to-end (e2e) Automatic Speech Recognition (ASR) models as a downstream task. Speech sentiment analysis involves classifying speech utterances into sentiment categories such as positive, negative, or neutral. It poses significant challenges due to variations in speakers, acoustic conditions, and the inherent complexity of emotional signals in speech. The novelty lies in utilizing e2e ASR models that integrate acoustic and text information, providing rich features for sentiment classification.
The approach uses an RNN with self-attention as the sentiment classifier, which offers interpretable model predictions through attention visualization. Two datasets are employed for evaluation: IEMOCAP, a well-benchmarked dataset, and SWBD-sentiment, a newly developed and extensive speech sentiment dataset with over 49,500 utterances. The results are notable, with improved accuracy on the IEMOCAP dataset from the previous state-of-the-art of 66.6% to 71.7% and achieving 70.10% accuracy on the SWBD-sentiment dataset.
Methodology
The proposed approach integrates acoustic and LLMs through pre-training with end-to-end ASR features. An RNN Transducer is selected for its simplicity, utilizing an encoder-decoder architecture to produce state-of-the-art results in ASR tasks. The encoder captures both linguistic and acoustic characteristics into a hidden representation suitable for sentiment analysis.
Several sentiment decoders are explored:
- MLP with pooling: This approach involves applying average or max pooling to convert variable-length ASR features into fixed-length embeddings for sentiment classification.
- RNN with pooling: This decoder applies a bi-directional LSTM, capturing context in sequential data before pooling is applied.
- RNN with multi-head self-attention: This method captures long-term dependencies and provides a mechanism for visualizing predictions through attention weights, enhancing interpretability.
To mitigate overfitting, Spectrogram Augmentation (SpecAugment) is employed, introducing random variations in the acoustic features during training.
Results and Contributions
The adoption of pre-trained ASR features significantly improves sentiment classification performance. On the IEMOCAP dataset, the method achieves a remarkable 71.7% accuracy, and on the SWBD-sentiment dataset, which has a challenging distribution and natural conversational style, it achieves 70.10% accuracy.
The paper's contributions are substantial:
- Improved Performance: Demonstrated efficacy of e2e ASR features in yielding superior sentiment classification results.
- Dataset Development: Creation of the SWBD-sentiment dataset, facilitating advancements in large-scale speech sentiment analysis research.
- Interpretable Models: Introduction of attention visualization, aiding in better understanding model predictions and integrating linguistic and acoustic information effectively.
Implications and Future Work
Practically, the advancements in speech sentiment analysis can enhance interactive intelligence systems across customer service, healthcare, and educational industries by providing more accurate sentiment understanding. Theoretically, the approach sets a precedent for utilizing e2e ASR models in diverse downstream tasks beyond sentiment analysis, such as speaker identification and diarization.
Future directions involve exploring unsupervised learning techniques for speech feature extraction and extending the application of e2e ASR features to additional domains. The work stimulates further research into refining sentiment decoders and strengthening the fusion of acoustic and language data for sentiment analysis.