Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework (2307.01715v3)
Abstract: Connectionist Temporal Classification (CTC) is a widely used criterion for training supervised sequence-to-sequence (seq2seq) models. It enables learning the relations between input and output sequences, termed alignments, by marginalizing over perfect alignments (that yield the ground truth), at the expense of imperfect alignments. This binary differentiation of perfect and imperfect alignments falls short of capturing other essential alignment properties that hold significance in other real-world applications. Here we propose $\textit{Align With Purpose}$, a $\textbf{general Plug-and-Play framework}$ for enhancing a desired property in models trained with the CTC criterion. We do that by complementing the CTC with an additional loss term that prioritizes alignments according to a desired property. Our method does not require any intervention in the CTC loss function, enables easy optimization of a variety of properties, and allows differentiation between both perfect and imperfect alignments. We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of training dataset (up to 280,000 hours). To demonstrate the effectiveness of our framework, we apply it to two unrelated properties: emission time and word error rate (WER). For the former, we report an improvement of up to 570ms in latency optimization with a minor reduction in WER, and for the latter, we report a relative improvement of 4.5% WER over the baseline models. To the best of our knowledge, these applications have never been demonstrated to work on a scale of data as large as ours. Notably, our method can be implemented using only a few lines of code, and can be extended to other alignment-free loss functions and to domains other than ASR.
- Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8717–8727, dec 2022. doi: 10.1109/tpami.2018.2889052. URL https://doi.org/10.1109%2Ftpami.2018.2889052.
- wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020a.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 12449–12460. Curran Associates, Inc., 2020b. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf.
- Whisperx: Time-accurate speech transcription of long-form audio. INTERSPEECH 2023, 2023.
- Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In ICASSP, 2016. URL http://williamchan.ca/papers/wchan-icassp-2016.pdf.
- Cisco. Vni complete forecast highlights, 2018. URL https://www.cisco.com/c/dam/m/en_us/solutions/service-provider/vni-forecast-highlights/pdf/Global_Device_Growth_Traffic_Profiles.pdf.
- Wav2letter: an end-to-end convnet-based speech recognition system, 2017. URL https://openreview.net/forum?id=BkUDvt5gg.
- Alex Graves. Sequence transduction with recurrent neural networks, 2012.
- Towards end-to-end speech recognition with recurrent neural networks. In Eric P. Xing and Tony Jebara (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 1764–1772, Bejing, China, 22–24 Jun 2014. PMLR. URL https://proceedings.mlr.press/v32/graves14.html.
- Offline handwriting recognition with multidimensional recurrent neural networks. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (eds.), Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008. URL https://proceedings.neurips.cc/paper_files/paper/2008/file/66368270ffd51418ec58bd793f2d9b1b-Paper.pdf.
- Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pp. 369–376, 2006.
- Deep speech: Scaling up end-to-end speech recognition, 2014.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, Los Alamitos, CA, USA, jun 2016. IEEE Computer Society. doi: 10.1109/CVPR.2016.90. URL https://doi.ieeecomputersociety.org/10.1109/CVPR.2016.90.
- Non-autoregressive translation with layer-wise prediction and deep supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 10776–10784, 2022.
- Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=rkE3y85ee.
- Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
- Powerful and extensible WFST framework for rnn-transducer losses. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, jun 2023. doi: 10.1109/icassp49357.2023.10096679. URL https://doi.org/10.1109%2Ficassp49357.2023.10096679.
- Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp. 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
- On the variance of the adaptive learning rate and beyond. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgz2aEKDr.
- BRIO: Bringing order to abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2890–2903, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.207. URL https://aclanthology.org/2022.acl-long.207.
- Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964.
- SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proc. Interspeech 2019, pp. 2613–2617, 2019. doi: 10.21437/Interspeech.2019-2680. URL http://dx.doi.org/10.21437/Interspeech.2019-2680.
- Minimum word error rate training for attention-based sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4839–4843. IEEE, 2018.
- Scaling Up Online Speech Recognition Using ConvNets. In Proc. Interspeech 2020, pp. 3376–3380, 2020. doi: 10.21437/Interspeech.2020-2840. URL http://dx.doi.org/10.21437/Interspeech.2020-2840.
- Robust speech recognition via large-scale weak supervision, 2022.
- Minimum latency training of sequence transducers for streaming end-to-end speech recognition. In Proc. Interspeech 2022, pp. 2098–2102, 2022. doi: 10.21437/Interspeech.2022-10989.
- Trimtail: Low-latency streaming asr with simple but effective spectrogram-level length penalty. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023. doi: 10.1109/ICASSP49357.2023.10097012.
- BAYES RISK CTC: CONTROLLABLE CTC ALIGNMENT IN SEQUENCE-TO-SEQUENCE TASKS. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=Bd7GueaTxUz.
- Peak-first ctc: Reducing the peak latency of ctc models by applying peak-first regularization. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023b. doi: 10.1109/ICASSP49357.2023.10095377.
- Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 993–1003, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.80. URL https://aclanthology.org/2021.acl-long.80.
- Transformer-based acoustic modeling for hybrid speech recognition. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, may 2020. doi: 10.1109/icassp40776.2020.9054345. URL https://doi.org/10.1109%2Ficassp40776.2020.9054345.
- Deep metric learning for accurate protein secondary structure prediction. Knowledge-Based Systems, 242:108356, 2022. ISSN 0950-7051. doi: https://doi.org/10.1016/j.knosys.2022.108356. URL https://www.sciencedirect.com/science/article/pii/S0950705122001332.
- Delay-penalized ctc implemented based on finite state transducer, 2023.
- Fastemit: Low-latency streaming asr with sequence-level emission regularization. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6004–6008, 2021. doi: 10.1109/ICASSP39728.2021.9413803.
- Why does ctc result in peaky behavior?, 2021.
- Eliya Segev (1 paper)
- Maya Alroy (1 paper)
- Ronen Katsir (1 paper)
- Noam Wies (10 papers)
- Ayana Shenhav (1 paper)
- Yael Ben-Oren (1 paper)
- David Zar (2 papers)
- Oren Tadmor (2 papers)
- Jacob Bitterman (2 papers)
- Amnon Shashua (44 papers)
- Tal Rosenwein (3 papers)