Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement (1811.06234v1)

Published 15 Nov 2018 in eess.AS, cs.LG, cs.SD, and eess.IV

Abstract: Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker. Recently, deep learning techniques have been adopted to solve the AV-SE task in a supervised manner. In this context, the choice of the target, i.e. the quantity to be estimated, and the objective function, which quantifies the quality of this estimate, to be used for training is critical for the performance. This work is the first that presents an experimental study of a range of different targets and objective functions used to train a deep-learning-based AV-SE system. The results show that the approaches that directly estimate a mask perform the best overall in terms of estimated speech quality and intelligibility, although the model that directly estimates the log magnitude spectrum performs as good in terms of estimated speech quality.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Daniel Michelsanti (9 papers)
  2. Zheng-Hua Tan (85 papers)
  3. Sigurdur Sigurdsson (6 papers)
  4. Jesper Jensen (41 papers)
Citations (21)

Summary

We haven't generated a summary for this paper yet.