Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Sequence-to-Sequence Acoustic Modeling by Adding Text-Supervision (1811.08111v1)

Published 20 Nov 2018 in cs.SD and eess.AS

Abstract: This paper presents methods of making using of text supervision to improve the performance of sequence-to-sequence (seq2seq) voice conversion. Compared with conventional frame-to-frame voice conversion approaches, the seq2seq acoustic modeling method proposed in our previous work achieved higher naturalness and similarity. In this paper, we further improve its performance by utilizing the text transcriptions of parallel training data. First, a multi-task learning structure is designed which adds auxiliary classifiers to the middle layers of the seq2seq model and predicts linguistic labels as a secondary task. Second, a data-augmentation method is proposed which utilizes text alignment to produce extra parallel sequences for model training. Experiments are conducted to evaluate our proposed method with training sets at different sizes. Experimental results show that the multi-task learning with linguistic labels is effective at reducing the errors of seq2seq voice conversion. The data-augmentation method can further improve the performance of seq2seq voice conversion when only 50 or 100 training utterances are available.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jing-Xuan Zhang (12 papers)
  2. Zhen-Hua Ling (114 papers)
  3. Yuan Jiang (48 papers)
  4. Li-Juan Liu (20 papers)
  5. Chen Liang (140 papers)
  6. Li-Rong Dai (26 papers)
Citations (27)

Summary

We haven't generated a summary for this paper yet.