Investigating the Emergent Audio Classification Ability of ASR Foundation Models

Published 15 Nov 2023 in cs.CL | (2311.09363v2)

Abstract: Text and vision foundation models can perform many tasks in a zero-shot setting, a desirable property that enables these systems to be applied in general and low-resource settings. There has been far less work, however, on the zero-shot abilities of ASR foundation models, with these systems typically fine-tuned to specific tasks or constrained to applications that match their training criterion and data annotation. In this work we investigate the ability of Whisper and MMS, ASR foundation models trained primarily for speech recognition, to perform zero-shot audio classification. We use simple template-based text prompts at the decoder and use the resulting decoding probabilities to generate zero-shot predictions. Without training the model on extra data or adding any new parameters, we demonstrate that Whisper shows promising zero-shot classification performance on a range of 8 audio-classification datasets, outperforming the accuracy of existing state-of-the-art zero-shot baselines by an average of 9%. One important step to unlock the emergent ability is debiasing, where a simple unsupervised reweighting method of the class probabilities yields consistent significant performance gains. We further show that performance increases with model size, implying that as ASR foundation models scale up, they may exhibit improved zero-shot performance.

Abstract PDF Upgrade to Chat

Citations (6)

View on Semantic Scholar

Summary

The paper demonstrates that ASR models achieve improved zero-shot audio classification using template-based prompting and unsupervised reweighting, with Whisper yielding a 9% accuracy gain.
The study employs a methodology that converts log-likelihoods into class probabilities without additional training, leveraging prior matching and null-input calibration to address biases.
Experiments across eight datasets reveal that larger ASR models enhance task generalization and outperform traditional baselines in diverse audio classification tasks.

Investigating the Emergent Audio Classification Ability of ASR Foundation Models

This paper examines the zero-shot audio classification capabilities of Automatic Speech Recognition (ASR) foundation models such as Whisper and MMS. The research employs no additional training or new parameters, utilizing template-based text prompts and evaluates on eight different datasets. It presents performance improvements through unsupervised reweighting and highlights trends related to model size and task generalization.

Zero-Shot Prompting of ASR Models

The authors employ a template-based method to convert log-likelihoods from audio samples into class probabilities. The whisper model displays promising performance improvements over existing zero-shot baselines, achieving an average accuracy boost of 9% without any model fine-tuning.

Figure 1: This paper looks at zero-shot prompting of ASR foundation models for audio classification, without any further training or introducing any new parameters. We use task-specific prompts and evaluate on various downstream tasks and datasets.

The probability of each class is calculated using generated likelihoods from ASR decoding:

$P_e(y = w_k|s) = \frac{P(e(t(w_k)|s))}{\sum_j P(e(t(w_j)|s))}$

The highest class probability yields the predicted class:

$\hat{y} = \arg\max_{w} P_e(w|s)$

Task Calibration Methodologies

Significant emphasis is placed on mitigating zero-shot biases present in ASR models. The authors explore prior-matching and null-input-based calibration:

Prior Matching: Reweights output probabilities using unsupervised data to ensure output priors match the expected true priors.
Null-Input Calibration: Utilizes null-input estimates to approximate biases across class distributions, achieving notable gains without additional data requirements.

These calibration methods are evidenced by improved class distribution and balance:

Figure 2: Predicted class distribution for Whisper large-v2 on RAVDESS. Bar width is proportional to the fraction of decisions per class.

Experimental Results and Analysis

The evaluations are focused on eight datasets, demonstrating Whisper's stronger performance against random and other baselines. With prior matching, Whisper displays average accuracies reaching 48.2% across tasks. Compared to large-scale models like CLAP, Whisper shows consistent performance improvements.

Figure 3: Accuracy on individual audio classification tasks across different sizes of Whisper models.

Robustness to Prompts: The results reveal sensitivity to various prompt formulations. An ensemble of prompts generally results in better performance indicating the versatility and adaptability of zero-shot classification.

Figure 4: Percentage of model predictions for each class with different calibration methods. On ESC-50, we only plot the top 15 classes predicted by the uncalibrated results for illustration.

Scaling and Model Variants

The performance scales with model size, with larger Whisper models demonstrating proficiency in task generalization. Additionally, multilingual versions outperform English-only models in many configurations. The paper underlines that ASR models like MMS show limited zero-shot classification abilities, primarily due to token alignment whereby Whisper’s attention mechanism inherently offers better coverage.

Figure 5: Parameter size vs average accuracy (with prior-matching) for different versions of Whisper models.

Audio Question Answering

Preliminary experiments expand zero-shot capabilities to audio question answering, using datasets like Clotho-AQA. Whisper demonstrates an ability to derive meaningful predictions over random baselines, further cemented by effective decision thresholds illustrated through precision-recall curves.

Figure 6: Zero-shot audio question answering method.

Conclusion

This study provides insightful evidence for the emergent capabilities of ASR foundation models in zero-shot audio classification. It opens avenues for leveraging such models with minimal data or parameter changes, optimizing performance using calibration techniques. The research concludes that with scaling, ASR models harness cross-domain capabilities paralleling advances seen in their NLP counterparts.

Overall, this research significantly enhances understanding of ASR models' generalizability, advocating for future exploration of zero-shot methods across ASR-induced tasks like question answering, scene recognition, and more.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Investigating the Emergent Audio Classification Ability of ASR Foundation Models

Summary

Investigating the Emergent Audio Classification Ability of ASR Foundation Models

Zero-Shot Prompting of ASR Models

Task Calibration Methodologies

Experimental Results and Analysis

Scaling and Model Variants

Audio Question Answering

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (4)

Collections

Tweets

Investigating the Emergent Audio Classification Ability of ASR Foundation Models

Summary

Investigating the Emergent Audio Classification Ability of ASR Foundation Models

Zero-Shot Prompting of ASR Models

Task Calibration Methodologies

Experimental Results and Analysis

Scaling and Model Variants

Audio Question Answering

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections

Tweets