Papers
Topics
Authors
Recent
Search
2000 character limit reached

WhisperNER: Unified Open Named Entity and Speech Recognition

Published 12 Sep 2024 in cs.CL and cs.LG | (2409.08107v1)

Abstract: Integrating named entity recognition (NER) with automatic speech recognition (ASR) can significantly enhance transcription accuracy and informativeness. In this paper, we introduce WhisperNER, a novel model that allows joint speech transcription and entity recognition. WhisperNER supports open-type NER, enabling recognition of diverse and evolving entities at inference. Building on recent advancements in open NER research, we augment a large synthetic dataset with synthetic speech samples. This allows us to train WhisperNER on a large number of examples with diverse NER tags. During training, the model is prompted with NER labels and optimized to output the transcribed utterance along with the corresponding tagged entities. To evaluate WhisperNER, we generate synthetic speech for commonly used NER benchmarks and annotate existing ASR datasets with open NER tags. Our experiments demonstrate that WhisperNER outperforms natural baselines on both out-of-domain open type NER and supervised finetuning.

Summary

  • The paper presents a joint model that integrates open-type NER into ASR, eliminating separate pipeline stages and reducing error propagation.
  • It leverages a synthetic audio dataset with innovative techniques like negative sampling and entity dropout to enhance generalization to unseen entities.
  • Experimental results show significant improvements in NER F1 scores and competitive word error rates across zero-shot and supervised benchmarks.

An Overview of WhisperNER: Unified Open Named Entity and Speech Recognition

The paper introduces WhisperNER, a novel joint model for automatic speech recognition (ASR) and named entity recognition (NER). This research is founded on the recent advancements in ASR and LLMs, particularly focusing on the integration of open-type NER directly into the ASR process. Unlike traditional pipeline architectures, where ASR is separate from NER, WhisperNER eliminates intermediate stages, thus reducing the risk of error propagation typically seen in sequential processing models.

Methodology

WhisperNER extends the Whisper ASR model to incorporate NER tags directly within its framework. It leverages a substantial synthetic dataset, enriched with audio samples, to facilitate training. The model uses an innovative approach where entity tags are prompted during the decoding process, allowing it to output transcriptions with corresponding entity annotations simultaneously. This technique enables WhisperNER to effectively generalize to new and unseen entities during inference. The paper also integrates negative sampling and entity type dropout to enhance the model's performance and reduce entity hallucination risks during inference.

Evaluation and Results

The paper conducts extensive experiments using both zero-shot and supervised fine-tuning setups across several benchmarks. WhisperNER's performance is evaluated using three open-type NER speech datasets: VoxPopuli-NER, LibriSpeech-NER, and Fleurs-NER. Results indicate that WhisperNER outperforms baseline models in terms of NER F1 scores while maintaining competitive WER scores, demonstrating efficient parameter utilization. Specifically, WhisperNER achieves superior outcomes on zero-shot open-type NER tasks, achieving significant improvements over traditional pipeline baselines.

In a supervised setting, WhisperNER continues to excel, achieving the highest F1 scores and lowest WER across the MIT NER benchmarks. These results underscore the model's efficacy in both recognizing novel entities and transcribing speech with accuracy.

Implications and Future Directions

The integration of open-type NER directly into ASR systems, as demonstrated by WhisperNER, presents significant theoretical and practical implications. This approach not only improves transcription accuracy but also enhances the informativeness of speech-driven applications, making it highly adaptable to various real-world scenarios that require both speech and language processing.

Future research could explore further refinements in handling closed-set zero-shot scenarios, as identified from the observed tendency of WhisperNER to incorrectly tag non-entity spans. Additionally, exploring other negative sampling methods or expanding the diversity of the training dataset could further enhance model generalization.

In conclusion, WhisperNER embodies a significant stride in the joint optimization of speech and language tasks. By effectively integrating NER within the ASR process, it sets a precedent for future developments in multifunctional AI systems, promising more robust and comprehensive solutions for speech-driven applications. The authors commit to fostering further research and innovation by releasing the source code, datasets, and models to the community.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 1 like about this paper.

HackerNews