Efficient Ensemble for Multimodal Punctuation Restoration using Time-Delay Neural Network (2302.13376v2)
Abstract: Punctuation restoration plays an essential role in the post-processing procedure of automatic speech recognition, but model efficiency is a key requirement for this task. To that end, we present EfficientPunct, an ensemble method with a multimodal time-delay neural network that outperforms the current best model by 1.0 F1 points, using less than a tenth of its inference network parameters. We streamline a speech recognizer to efficiently output hidden layer acoustic embeddings for punctuation restoration, as well as BERT to extract meaningful text embeddings. By using forced alignment and temporal convolutions, we eliminate the need for attention-based fusion, greatly increasing computational efficiency and raising performance. EfficientPunct sets a new state of the art with an ensemble that weights BERT's purely language-based predictions slightly more than the multimodal network's predictions. Our code is available at https://github.com/lxy-peter/EfficientPunct.
- M. A. Tündik, G. Szaszák, G. Gosztolya, and A. Beke, “User-centric evaluation of automatic punctuation in ASR closed captioning,” in Interspeech, 2018, pp. 2628–2632.
- A. Gravano, M. Jansche, and M. Bacchiani, “Restoring punctuation and capitalization in transcribed speech,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, pp. 4741–4744.
- O. Tilk and T. Alumäe, “Bidirectional recurrent neural network with attention mechanism for punctuation restoration,” in Interspeech, 2016, pp. 3047–3051.
- S. Kim, “Deep recurrent neural networks with layer-wise multi-head attentions for punctuation restoration,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2019, pp. 7280–7284.
- W. Salloum, G. Finley, E. Edwards, M. Miller, and D. Suendermann-Oeft, “Deep learning for punctuation restoration in medical reports,” in BioNLP, 2017, pp. 159–164.
- W. Wang, Y. Liu, W. Jiang, and Y. Ren, “Making punctuation restoration robust with disfluency detection,” in IEEE 25th International Conference on Computer Supported Cooperative Work in Design, 2022, pp. 395–399.
- Q. Huang, T. Ko, H. L. Tang, X. Liu, and B. Wu, “Token-level supervised contrastive learning for punctuation restoration,” in Interspeech, 2021, pp. 2012–2016.
- V. D. Lai, A. Salinas, H. Tan, T. Bui, Q. Tran, S. Yoon, H. Deilamsalehy, F. Dernoncourt, and T. H. Nguyen, “Boosting punctuation restoration with data generation and reinforcement learning,” in Interspeech, 2023, pp. 2133–2137.
- M. Courtland, A. Faulkner, and G. McElvain, “Efficient automatic punctuation restoration using bidirectional transformers with robust inference,” in Proceedings of the 17th International Conference on Spoken Language Translation, 2020, pp. 272–279.
- T. Alam, A. Khan, and F. Alam, “Punctuation restoration using transformer models for high-and low-resource languages,” in Proceedings of the 2020 EMNLP Workshop W-NUT: The Sixth Workshop on Noisy User-generated Text, 2020, pp. 132–142.
- A. M. Bakare, K. S. M. Anbananthen, S. Muthaiyah, J. Krishnan, and S. Kannan, “Punctuation restoration with transformer model on social media data,” Applied Sciences, vol. 13, no. 3, 2023. [Online]. Available: https://www.mdpi.com/2076-3417/13/3/1685
- M. Wang, Y. Li, J. Guo, X. Qiao, C. Su, M. Zhang, S. Tao, and H. Yang, “Zephyr: Zero-shot punctuation restoration,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.
- W. Lu and H. T. Ng, “Better punctuation prediction with dynamic conditional random fields,” in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 2010, pp. 177–186.
- T. B. Nguyen, Q. M. Nguyen, T. T. H. Nguyen, Q. T. Do, and C. M. Luong, “Improving Vietnamese named entity recognition from speech using word capitalization and punctuation recovery models,” in Interspeech, 2020, pp. 4263–4267.
- H. T. T. Uyen, N. A. Tu, and T. D. Huy, “Vietnamese capitalization and punctuation recovery models,” in Interspeech, 2022, pp. 3884–3888.
- N. Ueffing, M. Bisani, and P. Vozila, “Improved models for automatic punctuation prediction for spoken and written text,” in Interspeech, 2013, pp. 3097–3101.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
- H. Christensen, Y. Gotoh, and S. Renals, “Punctuation annotation using statistical prosody models,” in ITRW on Prosody in Speech Recognition and Understanding, 2001.
- O. Tilk and T. Alumäe, “LSTM for punctuation restoration in speech transcripts,” in Interspeech, 2015, pp. 683–687.
- O. Klejch, P. Bell, and S. Renals, “Sequence-to-sequence models for punctuated transcription combining lexical and acoustic features,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 5700–5704.
- J. Yi and J. Tao, “Self-attention based model for punctuation prediction using word and speech embeddings,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2019, pp. 7270–7274.
- O. Klejch, P. Bell, and S. Renals, “Punctuated transcription of multi-genre broadcasts using acoustic and lexical approaches,” in 2016 IEEE Spoken Language Technology Workshop, 2016, pp. 433–440.
- D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in International Conference on Learning Representations, 2015.
- M. Sunkara, S. Ronanki, D. Bekal, S. Bodapati, and K. Kirchhoff, “Multimodal semi-supervised learning framework for punctuation prediction in conversational speech,” in Interspeech, 2020, pp. 4911–4915.
- Y. Zhu, L. Wu, S. Cheng, and M. Wang, “Unified multimodal punctuation restoration framework for mixed-modality corpus,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 7272–7276.
- D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, 2011.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, 2020, pp. 12 449–12 460.
- F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Esteve, “TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation,” in 20th International Conference on Speech and Computer, 2018, pp. 198–208.
- A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” in IEEE Transactions on Acoustics, Speech, and Signal Processing, 1989, pp. 328–339.
- R. Cattoni, M. A. Di Gangi, L. Bentivogli, M. Negri, and M. Turchi, “MuST-C: A multilingual corpus for end-to-end speech translation,” in Computer Speech & Language Journal, 2020.
- J. X. Koh, A. Mislan, K. Khoo, B. Ang, W. Ang, C. Ng, and Y.-Y. Tan, “Building the Singapore English national speech corpus,” in Interspeech, 2019, pp. 321–325.
- A. Gupta, R. Ramanath, J. Shi, and S. S. Keerthi, “Adam vs. SGD: Closing the generalization gap on image classification,” in OPT2021: 13th Annual Workshop on Optimization for Machine Learning, 2021.