A Suite for Acoustic Language Model Evaluation (2409.07437v2)

Published 11 Sep 2024 in cs.SD, cs.CL, and eess.AS

Abstract: Speech LLMs have recently demonstrated great potential as universal speech processing systems. Such models have the ability to model the rich acoustic information existing in audio signals, beyond spoken content, such as emotion, background noise, etc. Despite this, evaluation benchmarks which evaluate awareness to a wide range of acoustic aspects, are lacking. To help bridge this gap, we introduce SALMon, a novel evaluation suite encompassing background noise, emotion, speaker identity and room impulse response. The proposed benchmarks both evaluate the consistency of the inspected element and how much it matches the spoken text. We follow a modelling based approach, measuring whether a model gives correct samples higher scores than incorrect ones. This approach makes the benchmark fast to compute even for large models. We evaluated several speech LLMs on SALMon, thus highlighting the strengths and weaknesses of each evaluated method. We make the code and data publicly available at https://pages.cs.huji.ac.il/adiyoss-lab/salmon/ .

Authors (3)

Gallil Maimon (8 papers)
Amit Roth (4 papers)
Yossi Adi (96 papers)

Summary

A Suite for Acoustic LLM Evaluation

This paper presents "SALMon," a novel evaluation suite designed to assess Speech LLMs (SLMs) on a broad spectrum of acoustic characteristics. The evaluation suite aims to fill the gap in current benchmarks which primarily emphasize semantic coherence and LLMing, by introducing comprehensive tests that consider essential acoustic elements such as background noise, emotion, speaker identity, and room impulse response (RIR).

Overview of SALMon

SALMon evaluates SLMs using two principal tasks: acoustic consistency and acoustic-semantic alignment. The first set of tasks, acoustic consistency, assesses the model's ability to maintain consistency in different acoustic properties within an audio sequence. This includes metrics like speaker consistency, gender consistency, sentiment consistency, and RIR consistency. The second set, acoustic-semantic alignment, evaluates whether the model's acoustic output aligns appropriately with the spoken content, examining elements such as background noise and sentiment alignment.

These tasks are designed to challenge the current capabilities of SLMs and provide a more nuanced understanding of their performance beyond mere semantic accuracy. The benchmark adopts a modeling-based approach, making it computationally efficient and straightforward to evaluate even with large models, which differentiates it from other text-generation-based evaluation methods that are more computationally intensive.

Methodology

The paper meticulously outlines the methodology involved in creating the SALMon benchmark. Key aspects include:

Acoustic Consistency: For each task, positive and negative samples are created by altering a specific acoustic property in the middle of the recording. The performance metric is the model's ability to assign a higher likelihood to positive samples over negative ones.
Acoustic-Semantic Alignment: This task involves generating audio samples where the acoustics align with the semantic content, for example, ensuring that the sentiment of the speech aligns with its textual sentiment. This task is more complex for SLMs and tests their joint understanding of semantic text and acoustic details.

All samples and tasks were generated using a combination of automatic and manual methods, ensuring high relevance and challenge for state-of-the-art models. The authors provide an evaluation script and data to facilitate easy replication and extension of the benchmark.

Results and Implications

The evaluation of several leading SLMs on the SALMon benchmark reveals significant insights. Human performance on these tasks far exceeds that of current SLMs, illustrating the challenges that lie ahead in developing more acoustically aware models.

Acoustic Consistency: Models like pGSLM, which includes prosodic features in its architecture, performed better than models operating solely on HuBERT units, such as TWIST and LAST. This suggests the importance of explicit prosodic features for certain acoustic tasks.
Acoustic-Semantic Alignment: Current SLMs show poor performance on these tasks, indicating an area where substantial progress is needed. Even large models struggled with these tasks, emphasizing the need for models that incorporate a better understanding of both semantics and acoustics.

Future Directions

The SALMon benchmark sets the stage for several future research directions:

Enhancement of Prosody Modeling: Given the superior performance of pGSLM in tasks involving emotional and prosodic features, future models could benefit from improved and more nuanced prosody modeling techniques.
Training on Diverse Data: The poor performance of current SLMs on background and RIR consistency tasks highlights the need for training data that includes a wider variety of acoustic scenarios.
Hybrid Models: Models that can seamlessly integrate text and audio modalities while capturing the intricate details of both could perform better on the SALMon benchmark.
Efficient Evaluation Metrics: The modeling-based metrics used in SALMon provide a fast and objective way to assess models, which could influence future benchmark designs for other multimodal tasks.

Conclusion

SALMon represents a critical step towards holistic evaluation of SLMs, emphasizing the importance of acoustic properties in speech processing. By providing a detailed and challenging benchmark, the authors significantly contribute to guiding future research in acoustic-aware LLMing. This benchmark will be instrumental in driving advancements in the field, ultimately leading to more sophisticated and holistic speech LLMs.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/GallilMaimon/status/1834136904008151094

https://twitter.com/MeetweenEU/status/1838158117424488784

https://twitter.com/GallilMaimon/status/1862050522456379583

YouTube

Show All Videos