Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Variational Connectionist Temporal Classification for Order-Preserving Sequence Modeling (2309.11983v3)

Published 21 Sep 2023 in cs.LG

Abstract: Connectionist temporal classification (CTC) is commonly adopted for sequence modeling tasks like speech recognition, where it is necessary to preserve order between the input and target sequences. However, CTC is only applied to deterministic sequence models, where the latent space is discontinuous and sparse, which in turn makes them less capable of handling data variability when compared to variational models. In this paper, we integrate CTC with a variational model and derive loss functions that can be used to train more generalizable sequence models that preserve order. Specifically, we derive two versions of the novel variational CTC based on two reasonable assumptions, the first being that the variational latent variables at each time step are conditionally independent; and the second being that these latent variables are Markovian. We show that both loss functions allow direct optimization of the variational lower bound for the model log-likelihood, and present computationally tractable forms for implementing them.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. “Speech recognition with deep recurrent neural networks,” in Proc. International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2013, pp. 6645–6649.
  2. “Enhancing code-switching speech recognition with interactive language biases,” arXiv preprint arXiv:2309.16953, 2023.
  3. “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proc. International Conference on Machine Learning. PMLR, 2021, pp. 5530–5540.
  4. “Variational connectionist temporal classification,” in Proc. European Conference on Computer Vision. Springer, 2020, pp. 460–476.
  5. “Unsupervised cross-modal alignment of speech and text embedding spaces,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  6. “Neural machine translation by jointly learning to align and translate,” in Proc. International Conference on Learning Representations, 2015.
  7. “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in Proc. International Conference on Acoustics, Speech and Signal Processing. IEEE, 2017, pp. 4835–4839.
  8. Jinyu Li et al., “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1, 2022.
  9. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. International Conference on Machine learning, 2006, pp. 369–376.
  10. A. Graves, “Sequence transduction with recurrent neural networks,” Computer Science, vol. 58, no. 3, pp. 235–242, 2012.
  11. “Order-Preserving Abstractive Summarization for Spoken Content Based on Connectionist Temporal Classification,” in Proc. Interspeech, 2017, pp. 2899–2903.
  12. “Auto-encoding variational bayes,” in Proc. International Conference on Machine learning, 2014.
  13. “Learning structured output representation using deep conditional generative models,” Advances in Neural Information Processing Systems, vol. 28, 2015.
  14. “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015.
  15. “Attention-based models for speech recognition,” Advances in Neural Information Processing Systems, vol. 28, 2015.
  16. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2016, pp. 4960–4964.
  17. “Exploring pre-training with alignments for rnn transducer based end-to-end speech recognition,” in Proc. International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2020, pp. 7079–7083.
  18. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460, 2020.
  19. “A recurrent latent variable model for sequential data,” Advances in Neural Information Processing Systems, vol. 28, 2015.
  20. “Variational attention for sequence-to-sequence models,” in Proc. International Conference on Computational Linguistics, 2018, pp. 1672–1682.
  21. “A transformer-based hierarchical variational autoencoder combined hidden markov model for long text generation,” Entropy, vol. 23, no. 10, pp. 1277, 2021.
  22. “Deep recurrent generative decoder for abstractive text summarization,” in Proc. Conference on Empirical Methods in Natural Language Processing, 2017, pp. 2091–2100.
  23. “Deep graph random process for relational-thinking-based speech recognition,” in Proc. International Conference on Machine Learning. PMLR, 2020, pp. 4531–4541.
  24. “A survey on bayesian deep learning,” ACM computing surveys (csur), vol. 53, no. 5, pp. 1–37, 2020.
  25. Paul A Gagniuc, Markov chains: from theory to implementation and experimentation, John Wiley & Sons, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zheng Nan (5 papers)
  2. Ting Dang (18 papers)
  3. Vidhyasaharan Sethu (11 papers)
  4. Beena Ahmed (14 papers)
Citations (2)