SBAAM! Eliminating Transcript Dependency in Automatic Subtitling (2405.10741v1)
Abstract: Subtitling plays a crucial role in enhancing the accessibility of audiovisual content and encompasses three primary subtasks: translating spoken dialogue, segmenting translations into concise textual units, and estimating timestamps that govern their on-screen duration. Past attempts to automate this process rely, to varying degrees, on automatic transcripts, employed diversely for the three subtasks. In response to the acknowledged limitations associated with this reliance on transcripts, recent research has shifted towards transcription-free solutions for translation and segmentation, leaving the direct generation of timestamps as uncharted territory. To fill this gap, we introduce the first direct model capable of producing automatic subtitles, entirely eliminating any dependence on intermediate transcripts also for timestamp prediction. Experimental results, backed by manual evaluation, showcase our solution's new state-of-the-art performance across multiple language pairs and diverse conditions.
- FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 1–61, Toronto, Canada (in-person and online).
- SpeechAlign: a Framework for Speech Translation Alignment Evaluation.
- Improving the automatic segmentation of subtitles through conditional random field. Speech Commun., 88(C):83–95.
- AppTek. 2022. SubER - Subtitle Edit Rate. https://github.com/apptek/SubER.
- Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France.
- Speech translation with style: AppTek’s submissions to the IWSLT subtitling and formality tracks in 2023. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 251–260, Toronto, Canada (in-person and online). Association for Computational Linguistics.
- Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation. In NIPS Workshop on end-to-end learning for speech and audio processing, Barcelona, Spain.
- Łukasz Bogucki. 2004. The constraint of relevance in subtitling. The Journal of Specialised Translation.
- ELITR multilingual live subtitling: Demo and strategy. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 271–277, Online.
- Parallel subtitle corpora and their applications in machine translation and translatology. Perspectives, 21(4):595–610.
- Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition. In Proc. INTERSPEECH 2023, pages 2908–2912.
- Accurate word alignment induction from neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 566–576, Online.
- Subtitle Translation as Markup Translation. In Proc. Interspeech 2021, pages 2237–2241.
- Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1068–1077, Online. Association for Computational Linguistics.
- Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46.
- One-to-Many Multilingual End-to-End Speech Translation. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 585–592.
- SONAR: sentence-level multimodal and language-agnostic representations.
- Robert Efron. 1970. The minimum duration of a perception. Neuropsychologia, 8(1):57–63.
- Machine translation for subtitling: A large-scale evaluation. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 46–53, Reykjavik, Iceland.
- CTC-based compression for direct speech translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 690–696, Online.
- Jointly learning to align and translate with transformer models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4453–4462, Hong Kong, China.
- Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, page 369–376, New York, NY, USA.
- Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020, pages 5036–5040.
- Ted-lium 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer, pages 198–208, Cham.
- HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 29:3451–3460.
- A fast two-dimensional median filtering algorithm. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(1):13–18.
- Multilingual End-to-End Speech Translation. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 570–577.
- Europarl-st: A multilingual corpus for speech translation of parliamentary debates. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8229–8233.
- B.-H. Juang. 1984. On the hidden markov model and dynamic time warping for speech recognition — a unified view. AT&T Bell Laboratories Technical Journal, 63(7):1213–1243.
- Is 42 the answer to everything in subtitling-oriented speech translation? In Proceedings of the 17th International Conference on Spoken Language Translation, pages 209–219, Online.
- MuST-cinema: a speech-to-subtitles corpus. In Proc. of the 12th Language Resources and Evaluation Conference, pages 3727–3734, Marseille, France.
- Bilal Khalaf. 2016. An introduction to subtitling: Challenges and strategies. International Journal of Comparative Literature and Translation Studies, 3.
- MT for subtitling: Investigating professional translators’ user experience and feedback. In Proceedings of 1st Workshop on Post-Editing in Modern-Day Translation, pages 79–92, Virtual. Association for Machine Translation in the Americas.
- MT for subtitling: User evaluation of post-editing productivity. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 115–124, Lisboa, Portugal.
- Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia.
- CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition. In Speech and Computer, pages 267–278, Cham.
- Isometric mt: Neural machine translation for automatic dubbing. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6242–6246.
- J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174.
- Textless speech-to-speech translation on real data. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 860–872, Seattle, United States. Association for Computational Linguistics.
- Bridging the modality gap for speech-to-text translation. arXiv preprint arXiv:2010.14920.
- Evaluating machine translation output with automatic sentence segmentation. In Proceedings of the Second International Workshop on Spoken Language Translation, Pittsburgh, Pennsylvania, USA.
- Customizing neural machine translation for subtitling. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pages 82–93, Florence, Italy.
- Antoni Oliver Gonzalez. 2006. Automatic multilingual subtitling in the eTitle project. In Proceedings of Translating and the Computer 28, London, UK. Aslib.
- Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210.
- Direct Speech Translation for Automatic Subtitling. Transactions of the Association for Computational Linguistics, 11:1355–1376.
- Direct models for simultaneous translation and automatic subtitling: FBK@IWSLT2023. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 159–168, Toronto, Canada (in-person and online). Association for Computational Linguistics.
- Dodging the data bottleneck: Automatic subtitling with automatically segmented ST corpora. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 480–487, Online only.
- Attention as a guide for simultaneous speech translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13340–13356, Toronto, Canada.
- SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proc. Interspeech 2019, pages 2613–2617.
- Multimodal, multilingual resources in the subtitling process. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal.
- Unsupervised subtitle segmentation with masked language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 771–781, Toronto, Canada.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium.
- Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
- Guiding Non-Autoregressive Neural Machine Translation Decoding with Reordering Information. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15):13727–13735.
- Emitting Word Timings with End-to-End Models. In Proc. Interspeech 2020, pages 3615–3619.
- Learning acoustic frame labeling for speech recognition with recurrent neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4280–4284.
- H. Sakoe and S. Chiba. 1978. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43–49.
- Seamless: Multilingual expressive and streaming speech translation.
- Han Sloetjes and Peter Wittenburg. 2008. Annotation by category: ELAN and ISO DCR. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco.
- Matthias Sperber and Matthias Paulik. 2020. Speech translation and the end-to-end promise: Taking stock of where we are. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7409–7421, Online.
- Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, Florence, Italy.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826.
- Isochrony-Aware Neural Machine Translation for Automatic Dubbing. In Proc. Interspeech 2022, pages 1776–1780.
- An analysis of attention mechanisms: The case of word sense disambiguation in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 26–35, Brussels, Belgium.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30.
- Timing in audiovisual speech perception: A mini review and new psychophysical data. Attention, perception & psychophysics.
- Machine translation of TV subtitles for large scale production. In Proceedings of the Second Joint EM+/CNGL Workshop: Bringing MT to the User: Research on Integrating MT in the Translation Industry, pages 53–62, Denver, Colorado, USA.
- VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 993–1003, Online.
- fairseq s2t: Fast speech-to-text modeling with fairseq. In Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations.
- Covost 2: A massively multilingual speech-to-text translation corpus.
- Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8):1240–1253.
- Sequence-to-Sequence Models Can Directly Translate Foreign Speech. In Proceedings of Interspeech 2017, pages 2625–2629, Stockholm, Sweden.
- SubER - a metric for automatic evaluation of subtitle quality. In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), pages 1–10, Dublin, Ireland (in-person and online).
- ELAN: a professional framework for multimodality research. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy.
- On decoder-only architecture for speech-to-text and large language model integration.
- Stacked acoustic-and-textual encoding: Integrating the pre-trained models into speech translation encoders. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2619–2630, Online. Association for Computational Linguistics.
- Recent Advances in Direct Speech-to-text Translation. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China, pages 6796–6804. ijcai.org.
- How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation? In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4392–4400, Online. Association for Computational Linguistics.
- CTC alignments improve autoregressive translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1623–1639, Dubrovnik, Croatia.
- Zhang Yuxin and Yoshikazu Miyanaga. 2011. An improved dynamic time warping algorithm employing nonlinear median filtering. In 2011 11th International Symposium on Communications & Information Technologies (ISCIT), pages 439–442.
- Adding interpretable attention to neural translation models improves word alignment. arXiv preprint arXiv:1901.11359.
- Revisiting End-to-End Speech-to-Text Translation From Scratch. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 26193–26205.
- SpeechUT: Bridging speech and text with hidden-unit for encoder-decoder based speech-text pre-training. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1663–1676, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Marco Gaido (47 papers)
- Sara Papi (33 papers)
- Matteo Negri (93 papers)
- Mauro Cettolo (20 papers)
- Luisa Bentivogli (38 papers)