Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audio-Aware Large Language Models as Judges for Speaking Styles (2506.05984v1)

Published 6 Jun 2025 in eess.AS, cs.AI, and cs.CL

Abstract: Audio-aware LLMs (ALLMs) can understand the textual and non-textual information in the audio input. In this paper, we explore using ALLMs as an automatic judge to assess the speaking styles of speeches. We use ALLM judges to evaluate the speeches generated by SLMs on two tasks: voice style instruction following and role-playing. The speaking style we consider includes emotion, volume, speaking pace, word emphasis, pitch control, and non-verbal elements. We use four spoken LLMs (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs' responses. We compare two ALLM judges, GPT-4o-audio and Gemini-2.5-pro, with human evaluation results and show that the agreement between Gemini and human judges is comparable to the agreement between human evaluators. These promising results show that ALLMs can be used as a judge to evaluate SLMs. Our results also reveal that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Cheng-Han Chiang (18 papers)
  2. Xiaofei Wang (138 papers)
  3. Chung-Ching Lin (36 papers)
  4. Kevin Lin (98 papers)
  5. Linjie Li (89 papers)
  6. Radu Kopetz (1 paper)
  7. Yao Qian (37 papers)
  8. Zhendong Wang (60 papers)
  9. Zhengyuan Yang (86 papers)
  10. Hung-yi Lee (327 papers)
  11. Lijuan Wang (133 papers)