Shiftable Context: Addressing Training-Inference Context Mismatch in Simultaneous Speech Translation (2307.01377v1)
Abstract: Transformer models using segment-based processing have been an effective architecture for simultaneous speech translation. However, such models create a context mismatch between training and inference environments, hindering potential translation accuracy. We solve this issue by proposing Shiftable Context, a simple yet effective scheme to ensure that consistent segment and context sizes are maintained throughout training and inference, even with the presence of partially filled segments due to the streaming nature of simultaneous translation. Shiftable Context is also broadly applicable to segment-based transformers for streaming tasks. Our experiments on the English-German, English-French, and English-Spanish language pairs from the MUST-C dataset demonstrate that when applied to the Augmented Memory Transformer, a state-of-the-art model for simultaneous speech translation, the proposed scheme achieves an average increase of 2.09, 1.83, and 1.95 BLEU scores across each wait-k value for the three language pairs, respectively, with a minimal impact on computation-aware Average Lagging.
- Must-c: A multilingual corpus for end-to-end speech translation. Computer Speech & Language, 66:101155, 2021. ISSN 0885-2308. doi: https://doi.org/10.1016/j.csl.2020.101155. URL https://www.sciencedirect.com/science/article/pii/S0885230820300887.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Self-attention aligner: A latency-control end-to-end model for asr using self-attention network and chunk-hopping. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5656–5660. IEEE, 2019.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Stacl: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework. arXiv preprint arXiv:1810.08398, 2018.
- SIMULEVAL: An evaluation toolkit for simultaneous translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 144–150, Online, October 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.19. URL https://aclanthology.org/2020.emnlp-demos.19.
- Simulmt to simulst: Adapting simultaneous text translation to end-to-end simultaneous speech translation. arXiv preprint arXiv:2011.02048, 2020b.
- Streaming simultaneous speech translation with augmented memory transformer. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7523–7527. IEEE, 2021.
- fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
- Bleu: a method for automatic evaluation of machine translation. In Annual Meeting of the Association for Computational Linguistics, 2002.
- Post, M. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771, 2018.
- Implicit memory transformer for computationally efficient simultaneous speech translation. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, 2023. Association for Computational Linguistics.
- Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
- Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6783–6787. IEEE, 2021.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- fairseq s2t: Fast speech-to-text modeling with fairseq. In Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations, 2020.
- Streaming transformer-based acoustic models using self-attention with augmented memory. arXiv preprint arXiv:2005.08042, 2020.
- Matthew Raffel (7 papers)
- Drew Penney (4 papers)
- Lizhong Chen (24 papers)