Unified Segment-to-Segment Framework for Simultaneous Sequence Generation (2310.17940v4)
Abstract: Simultaneous sequence generation is a pivotal task for real-time scenarios, such as streaming speech recognition, simultaneous machine translation and simultaneous speech translation, where the target sequence is generated while receiving the source sequence. The crux of achieving high-quality generation with low latency lies in identifying the optimal moments for generating, accomplished by learning a mapping between the source and target sequences. However, existing methods often rely on task-specific heuristics for different sequence types, limiting the model's capacity to adaptively learn the source-target mapping and hindering the exploration of multi-task learning for various simultaneous tasks. In this paper, we propose a unified segment-to-segment framework (Seg2Seg) for simultaneous sequence generation, which learns the mapping in an adaptive and unified manner. During the process of simultaneous generation, the model alternates between waiting for a source segment and generating a target segment, making the segment serve as the natural bridge between the source and target. To accomplish this, Seg2Seg introduces a latent segment as the pivot between source to target and explores all potential source-target mappings via the proposed expectation training, thereby learning the optimal moments for generating. Experiments on multiple simultaneous generation tasks demonstrate that Seg2Seg achieves state-of-the-art performance and exhibits better generality across various tasks.
- Simultaneous translation of lectures and speeches. Machine Translation, 21(4):209–252, 2007. ISSN 09226567, 15730573. URL https://link.springer.com/article/10.1007/s10590-008-9047-0.
- Optimizing segmentation strategies for simultaneous speech translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 551–556, Baltimore, Maryland, June 2014. Association for Computational Linguistics. doi: 10.3115/v1/P14-2090. URL https://aclanthology.org/P14-2090.
- SimulSpeech: End-to-end simultaneous speech to text translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3787–3796, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.350. URL https://aclanthology.org/2020.acl-main.350.
- STACL: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3025–3036, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1289. URL https://www.aclweb.org/anthology/P19-1289.
- Monotonic infinite lookback attention for simultaneous machine translation. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1313–1323, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1126. URL https://aclanthology.org/P19-1126.
- On Online Attention-Based Speech Recognition and Joint Mandarin Character-Pinyin Training. In Proc. Interspeech 2016, pages 3404–3408, 2016. doi: 10.21437/Interspeech.2016-334. URL https://www.isca-speech.org/archive/interspeech_2016/chan16c_interspeech.html.
- Gaussian prediction based attention for online end-to-end speech recognition. In Proc. Interspeech 2017, pages 3692–3696, 2017. doi: 10.21437/Interspeech.2017-751. URL http://dx.doi.org/10.21437/Interspeech.2017-751.
- A better and faster end-to-end model for streaming asr. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5634–5638, 2021. doi: 10.1109/ICASSP39728.2021.9413899. URL https://ieeexplore.ieee.org/iel7/9413349/9413350/09413899.pdf.
- Towards fast and accurate streaming end-to-end asr. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6069–6073, 2020. doi: 10.1109/ICASSP40776.2020.9054715.
- Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf.
- Learning to translate in real-time with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1053–1062, Valencia, Spain, April 2017. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/E17-1099.
- Wait-info policy: Balancing source and target at information level for simultaneous machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2249–2263, Abu Dhabi, United Arab Emirates, December 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.166. URL https://aclanthology.org/2022.findings-emnlp.166.
- Hidden markov transformer for simultaneous machine translation. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=9y0HFvaAYD6.
- Enhancing Monotonic Multihead Attention for Streaming ASR. In Proc. Interspeech 2020, pages 2137–2141, 2020a. doi: 10.21437/Interspeech.2020-1780. URL http://dx.doi.org/10.21437/Interspeech.2020-1780.
- Streaming automatic speech recognition with the transformer model. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6074–6078, 2020. doi: 10.1109/ICASSP40776.2020.9054476. URL https://ieeexplore.ieee.org/document/9054476.
- Dual-mode {asr}: Unify and improve streaming {asr} with full-context modeling. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=Pz_dcqfcKW8.
- Can neural machine translation do simultaneous translation? CoRR, abs/1606.02012, 2016. URL http://arxiv.org/abs/1606.02012.
- Universal simultaneous machine translation with mixture-of-experts wait-k policy. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7306–7317, Online and Punta Cana, Dominican Republic, November 2021a. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.581. URL https://aclanthology.org/2021.emnlp-main.581.
- Bayling: Bridging cross-lingual alignment and instruction following through interactive translation for large language models, 2023. URL https://arxiv.org/abs/2306.10968.
- RealTranS: End-to-end simultaneous speech translation with convolutional weighted-shrinking transformer. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2461–2474, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.218. URL https://aclanthology.org/2021.findings-acl.218.
- Learning adaptive segmentation policy for end-to-end simultaneous translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7862–7874, Dublin, Ireland, May 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.542. URL https://aclanthology.org/2022.acl-long.542.
- Monotonic chunkwise attention. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hko85plCW.
- Efficient Wait-k Models for Simultaneous Machine Translation, 2020. URL http://dx.doi.org/10.21437/Interspeech.2020-1241.
- Monotonic multihead attention. In International Conference on Learning Representations, 2020a. URL https://openreview.net/forum?id=Hyg96gBKPS.
- Gaussian multi-head attention for simultaneous machine translation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3019–3030, Dublin, Ireland, May 2022a. Association for Computational Linguistics. URL https://aclanthology.org/2022.findings-acl.238.
- SimulMT to SimulST: Adapting simultaneous text translation to end-to-end simultaneous speech translation. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 582–587, Suzhou, China, December 2020b. Association for Computational Linguistics. URL https://aclanthology.org/2020.aacl-main.58.
- Learning when to translate for streaming speech. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 680–694, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.50. URL https://aclanthology.org/2022.acl-long.50.
- Sebastian Ruder. An Overview of Multi-Task Learning in Deep Neural Networks. arXiv e-prints, art. arXiv:1706.05098, June 2017. doi: 10.48550/arXiv.1706.05098.
- Yu Zhang and Qiang Yang. An overview of multi-task learning. National Science Review, 5(1):30–43, 09 2017. ISSN 2095-5138. doi: 10.1093/nsr/nwx105. URL https://doi.org/10.1093/nsr/nwx105.
- Tied multitask learning for neural speech translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 82–91, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1008. URL https://aclanthology.org/N18-1008.
- Modeling dual read/write paths for simultaneous machine translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2461–2477, Dublin, Ireland, May 2022b. Association for Computational Linguistics. URL https://aclanthology.org/2022.acl-long.176.
- Alex Graves. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012. URL https://arxiv.org/abs/1211.3711.
- A neural transducer, 2016. URL https://arxiv.org/abs/1511.04868.
- Transformer-transducer: End-to-end speech recognition with self-attention. arXiv preprint arXiv:1910.12977, 2019. URL https://arxiv.org/abs/1910.12977.
- Online and linear-time attention by enforcing monotonic alignments. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2837–2846. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/raffel17a.html.
- Emitting Word Timings with End-to-End Models. In Proc. Interspeech 2020, pages 3615–3619, 2020. doi: 10.21437/Interspeech.2020-1059. URL http://dx.doi.org/10.21437/Interspeech.2020-1059.
- Fastemit: Low-latency streaming asr with sequence-level emission regularization. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6004–6008. IEEE, 2021b. URL https://ieeexplore.ieee.org/abstract/document/9413803/.
- Reducing Streaming ASR Model Delay with Self Alignment. In Proc. Interspeech 2021, pages 3440–3444, 2021. doi: 10.21437/Interspeech.2021-322. URL https://www.isca-speech.org/archive/pdfs/interspeech_2021/kim21j_interspeech.pdf.
- Minimum latency training strategies for streaming sequence-to-sequence asr. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6064–6068, 2020b. doi: 10.1109/ICASSP40776.2020.9054098. URL https://ieeexplore.ieee.org/abstract/document/9054098/.
- Alignment knowledge distillation for online streaming attention-based speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1371–1385, 2023. doi: 10.1109/TASLP.2021.3133217. URL https://ieeexplore.ieee.org/abstract/document/9640576/.
- Future-guided incremental transformer for simultaneous translation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(16):14428–14436, May 2021. URL https://ojs.aaai.org/index.php/AAAI/article/view/17696.
- ICT’s system for AutoSimTrans 2021: Robust char-level simultaneous translation. In Proceedings of the Second Workshop on Automatic Simultaneous Translation, pages 1–11, Online, June 2021b. Association for Computational Linguistics. doi: 10.18653/v1/2021.autosimtrans-1.1. URL https://aclanthology.org/2021.autosimtrans-1.1.
- Simultaneous machine translation with tailored reference. In Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, December 2023a. URL https://arxiv.org/abs/2310.13588.
- Reducing position bias in simultaneous machine translation with length-aware framework. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6775–6788, Dublin, Ireland, May 2022c. Association for Computational Linguistics. URL https://aclanthology.org/2022.acl-long.467.
- Glancing future for simultaneous machine translation, 2023b. URL https://arxiv.org/abs/2309.06179.
- Simultaneous translation policies: From fixed to adaptive. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2847–2853, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.254. URL https://www.aclweb.org/anthology/2020.acl-main.254.
- Turning fixed to adaptive: Integrating post-evaluation into simultaneous machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2264–2278, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.167. URL https://aclanthology.org/2022.findings-emnlp.167.
- Learning optimal policy for simultaneous machine translation via binary search. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2318–2333, Toronto, Canada, July 2023c. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.130. URL https://aclanthology.org/2023.acl-long.130.
- Non-autoregressive streaming transformer for simultaneous translation, 2023. URL https://arxiv.org/abs/2310.14883.
- Incremental segmentation and decoding strategies for simultaneous translation. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 1032–1036, Nagoya, Japan, October 2013. Asian Federation of Natural Language Processing. URL https://aclanthology.org/I13-1141.
- Segmentation strategies for streaming speech translation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 230–238, Atlanta, Georgia, June 2013. Association for Computational Linguistics. URL https://aclanthology.org/N13-1023.
- Direct simultaneous speech-to-text translation assisted by synchronized streaming ASR. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4618–4624, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.406. URL https://aclanthology.org/2021.findings-acl.406.
- Information-transport-based policy for simultaneous translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 992–1013, Abu Dhabi, United Arab Emirates, December 2022d. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.65. URL https://aclanthology.org/2022.emnlp-main.65.
- End-to-end simultaneous speech translation with differentiable segmentation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7659–7680, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.485. URL https://aclanthology.org/2023.findings-acl.485.
- Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
- Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978, 2009. ISSN 0888-613X. doi: https://doi.org/10.1016/j.ijar.2008.11.006. URL https://www.sciencedirect.com/science/article/pii/S0888613X08001813. Special Section on Graphical Models and Information Retrieval.
- Learning to communicate with deep multi-agent reinforcement learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/c7635bfd99248a2cdef8249ef7bfbef4-Paper.pdf.
- Modeling concentrated cross-attention for neural machine translation with Gaussian mixture model. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1401–1411, Punta Cana, Dominican Republic, November 2021c. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.121. URL https://aclanthology.org/2021.findings-emnlp.121.
- Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964.
- SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL https://aclanthology.org/D18-2012.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://www.aclweb.org/anthology/P16-1162.
- MuST-C: a Multilingual Speech Translation Corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2012–2017, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1202. URL https://aclanthology.org/N19-1202.
- A generative framework for simultaneous machine translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6697–6706, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.emnlp-main.536.
- Cross-Modal Decision Regularization for Simultaneous Speech Translation. In Proc. Interspeech 2022, pages 116–120, 2022. doi: 10.21437/Interspeech.2022-10617. URL https://www.isca-speech.org/archive/interspeech_2022/zaidi22_interspeech.html.
- Linhao Dong and Bo Xu. Cif: Continuous integrate-and-fire for end-to-end speech recognition. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6079–6083, 2020. doi: 10.1109/ICASSP40776.2020.9054250. URL https://ieeexplore.ieee.org/iel7/9040208/9052899/09054250.pdf.
- fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-4009. URL https://www.aclweb.org/anthology/N19-4009.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12449–12460. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf.
- SIMULEVAL: An evaluation toolkit for simultaneous translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 144–150, Online, October 2020c. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.19. URL https://aclanthology.org/2020.emnlp-demos.19.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://www.aclweb.org/anthology/P02-1040.
- Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-6319. URL https://www.aclweb.org/anthology/W18-6319.
- STEMM: Self-learning with speech-text manifold mixup for speech translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7050–7062, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.486. URL https://aclanthology.org/2022.acl-long.486.
- Understanding and bridging the modality gap for speech translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15864–15881, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.884. URL https://aclanthology.org/2023.acl-long.884.
- CMOT: Cross-modal mixup via optimal transport for speech translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7873–7887, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.436. URL https://aclanthology.org/2023.acl-long.436.
- An embedded segmental k-means model for unsupervised segmentation and clustering of speech. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 719–726, 2017a. doi: 10.1109/ASRU.2017.8269008. URL https://www.kamperh.com/papers/kamper+livescu+goldwater_asru2017.pdf.
- A segmental framework for fully-unsupervised large-vocabulary speech recognition. Computer Speech & Language, 46:154–174, 2017b. ISSN 0885-2308. doi: https://doi.org/10.1016/j.csl.2017.04.008. URL https://www.sciencedirect.com/science/article/pii/S0885230816301905.
- Herman Kamper and Benjamin van Niekerk. Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks. arXiv e-prints, art. arXiv:2012.07551, December 2020. URL https://ui.adsabs.harvard.edu/abs/2020arXiv201207551K.
- Unsupervised word segmentation using k nearest neighbors. arXiv preprint arXiv:2204.13094, 2022. URL https://arxiv.org/abs/2204.13094.
- The buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Communication, 45(1):89–95, 2005. ISSN 0167-6393. doi: https://doi.org/10.1016/j.specom.2004.09.001. URL https://www.sciencedirect.com/science/article/pii/S0167639304000974.
- An improved speech segmentation quality measure: the r-value. In 10th Interspeech Conference, Brighton, UK, September 6-10, 2009, 2009. URL https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=91ff68c684116aeaa4de8b407fa79bbf1e05dc3c.
- A comparison of different approaches to automatic speech segmentation. In Petr Sojka, Ivan Kopeček, and Karel Pala, editors, Text, Speech and Dialogue, pages 277–284, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg. ISBN 978-3-540-46154-8. URL https://link.springer.com/chapter/10.1007/3-540-46154-X_38.