S2Cap: A Benchmark and a Baseline for Singing Style Captioning (2409.09866v2)

Published 15 Sep 2024 in cs.CL, cs.AI, cs.LG, cs.SD, and eess.AS

Abstract: Singing voices contain much richer information than common voices, such as diverse vocal and acoustic characteristics. However, existing open-source audio-text datasets for singing voices capture only a limited set of attributes and lacks acoustic features, leading to limited utility towards downstream tasks, such as style captioning. To fill this gap, we formally consider the task of singing style captioning and introduce S2Cap, a singing voice dataset with comprehensive descriptions of diverse vocal, acoustic and demographic attributes. Based on this dataset, we develop a simple yet effective baseline algorithm for the singing style captioning. The algorithm utilizes two novel technical components: CRESCENDO for mitigating misalignment between pretrained unimodal models, and demixing supervision to regularize the model to focus on the singing voice. Despite its simplicity, the proposed method outperforms state-of-the-art baselines.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

GitHub

GitHub - HJ-Ok/S2cap: S2cap ♥: Constructing a Singing Style Caption Dataset (6 stars)

S2Cap: A Benchmark and a Baseline for Singing Style Captioning (2409.09866v2)

Summary

Related Papers

GitHub