VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing (2408.05758v1)
Abstract: Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality. We propose a method called "Vector Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP)", which uses the cross-modal aligned sequence transcoder to bring text and speech into a joint multimodal space, learning how to connect text and speech at the frame level. The proposed VQ-CTAP is a paradigm for cross-modal sequence representation learning, offering a promising solution for fine-grained generation and recognition tasks in speech processing. The VQ-CTAP can be directly applied to VC and ASR tasks without fine-tuning or additional structures. We propose a sequence-aware semantic connector, which connects multiple frozen pre-trained modules for the TTS task, exhibiting a plug-and-play capability. We design a stepping optimization strategy to ensure effective model convergence by gradually injecting and adjusting the influence of various loss components. Furthermore, we propose a semantic-transfer-wise paralinguistic consistency loss to enhance representational capabilities, allowing the model to better generalize to unseen data and capture the nuances of paralinguistic information. In addition, VQ-CTAP achieves high-compression speech coding at a rate of 25Hz from 24kHz input waveforms, which is a 960-fold reduction in the sampling rate. The audio demo is available at https://qiangchunyu.github.io/VQCTAP/
- Chunyu Qiang (21 papers)
- Wang Geng (4 papers)
- Yi Zhao (222 papers)
- Ruibo Fu (54 papers)
- Tao Wang (700 papers)
- Cheng Gong (51 papers)
- Tianrui Wang (23 papers)
- Qiuyu Liu (2 papers)
- Jiangyan Yi (77 papers)
- Zhengqi Wen (69 papers)
- Chen Zhang (403 papers)
- Hao Che (10 papers)
- Longbiao Wang (46 papers)
- Jianwu Dang (41 papers)
- Jianhua Tao (139 papers)