VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing (2408.05758v1)

Published 11 Aug 2024 in eess.AS, cs.AI, cs.CL, and cs.SD

Abstract: Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality. We propose a method called "Vector Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP)", which uses the cross-modal aligned sequence transcoder to bring text and speech into a joint multimodal space, learning how to connect text and speech at the frame level. The proposed VQ-CTAP is a paradigm for cross-modal sequence representation learning, offering a promising solution for fine-grained generation and recognition tasks in speech processing. The VQ-CTAP can be directly applied to VC and ASR tasks without fine-tuning or additional structures. We propose a sequence-aware semantic connector, which connects multiple frozen pre-trained modules for the TTS task, exhibiting a plug-and-play capability. We design a stepping optimization strategy to ensure effective model convergence by gradually injecting and adjusting the influence of various loss components. Furthermore, we propose a semantic-transfer-wise paralinguistic consistency loss to enhance representational capabilities, allowing the model to better generalize to unseen data and capture the nuances of paralinguistic information. In addition, VQ-CTAP achieves high-compression speech coding at a rate of 25Hz from 24kHz input waveforms, which is a 960-fold reduction in the sampling rate. The audio demo is available at https://qiangchunyu.github.io/VQCTAP/

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

Authors (15)

Chunyu Qiang (21 papers)
Wang Geng (4 papers)
Yi Zhao (222 papers)
Ruibo Fu (54 papers)
Tao Wang (700 papers)
Cheng Gong (51 papers)
Tianrui Wang (23 papers)
Qiuyu Liu (2 papers)
Jiangyan Yi (77 papers)
Zhengqi Wen (69 papers)
Chen Zhang (403 papers)
Hao Che (10 papers)
Longbiao Wang (46 papers)
Jianwu Dang (41 papers)
Jianhua Tao (139 papers)

GitHub

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing (2408.05758v1)

Related Papers

GitHub