Soft Alignment of Modality Space for End-to-end Speech Translation (2312.10952v1)

Published 18 Dec 2023 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: End-to-end Speech Translation (ST) aims to convert speech into target text within a unified model. The inherent differences between speech and text modalities often impede effective cross-modal and cross-lingual transfer. Existing methods typically employ hard alignment (H-Align) of individual speech and text segments, which can degrade textual representations. To address this, we introduce Soft Alignment (S-Align), using adversarial training to align the representation spaces of both modalities. S-Align creates a modality-invariant space while preserving individual modality quality. Experiments on three languages from the MuST-C dataset show S-Align outperforms H-Align across multiple tasks and offers translation capabilities on par with specialized translation models.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (7)

Yuhao Zhang (107 papers)
Kaiqi Kou (2 papers)
Bei Li (51 papers)
Chen Xu (186 papers)
Chunliang Zhang (12 papers)
Tong Xiao (119 papers)
Jingbo Zhu (79 papers)

Soft Alignment of Modality Space for End-to-end Speech Translation (2312.10952v1)

Related Papers