Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation (2203.10426v1)

Published 20 Mar 2022 in cs.CL and cs.AI

Abstract: How to learn a better speech representation for end-to-end speech-to-text translation (ST) with limited labeled data? Existing techniques often attempt to transfer powerful machine translation (MT) capabilities to ST, but neglect the representation discrepancy across modalities. In this paper, we propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy. Specifically, we mix up the representation sequences of different modalities, and take both unimodal speech sequences and multimodal mixed sequences as input to the translation model in parallel, and regularize their output predictions with a self-learning framework. Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy, and achieves significant improvements over a strong baseline on eight translation directions.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Qingkai Fang (19 papers)
  2. Rong Ye (20 papers)
  3. Lei Li (1293 papers)
  4. Yang Feng (230 papers)
  5. Mingxuan Wang (83 papers)
Citations (87)