Papers
Topics
Authors
Recent
2000 character limit reached

S2Cap: A Benchmark and a Baseline for Singing Style Captioning

Published 15 Sep 2024 in cs.CL, cs.AI, cs.LG, cs.SD, and eess.AS | (2409.09866v2)

Abstract: Singing voices contain much richer information than common voices, such as diverse vocal and acoustic characteristics. However, existing open-source audio-text datasets for singing voices capture only a limited set of attributes and lacks acoustic features, leading to limited utility towards downstream tasks, such as style captioning. To fill this gap, we formally consider the task of singing style captioning and introduce S2Cap, a singing voice dataset with comprehensive descriptions of diverse vocal, acoustic and demographic attributes. Based on this dataset, we develop a simple yet effective baseline algorithm for the singing style captioning. The algorithm utilizes two novel technical components: CRESCENDO for mitigating misalignment between pretrained unimodal models, and demixing supervision to regularize the model to focus on the singing voice. Despite its simplicity, the proposed method outperforms state-of-the-art baselines.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.