Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Test-Time Training for Speech (2309.10930v2)

Published 19 Sep 2023 in cs.SD, cs.LG, and eess.AS

Abstract: In this paper, we study the application of Test-Time Training (TTT) as a solution to handling distribution shifts in speech applications. In particular, we introduce distribution-shifts to the test datasets of standard speech-classification tasks -- for example, speaker-identification and emotion-detection -- and explore how Test-Time Training (TTT) can help adjust to the distribution-shift. In our experiments that include distribution shifts due to background noise and natural variations in speech such as gender and age, we identify some key-challenges with TTT including sensitivity to optimization hyperparameters (e.g., number of optimization steps and subset of parameters chosen for TTT) and scalability (e.g., as each example gets its own set of parameters, TTT is not scalable). Finally, we propose using BitFit -- a parameter-efficient fine-tuning algorithm proposed for text applications that only considers the bias parameters for fine-tuning -- as a solution to the aforementioned challenges and demonstrate that it is consistently more stable than fine-tuning all the parameters of the model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Self-supervised test-time adaptation on video data. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3439–3448, 2022.
  2. Mae-ast: Masked autoencoding audio spectrogram transformer. arXiv preprint arXiv:2203.16691, 2022.
  3. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 34, 2020.
  4. Metareg: Towards domain generalization using meta-regularization. Advances in neural information processing systems, 31, 2018.
  5. Self-supervised test-time learning for reading comprehension. arXiv preprint arXiv:2103.11263, 2021.
  6. In search for a generalizable method for source free domain adaptation. arXiv preprint arXiv:2302.06658, 2023.
  7. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359, 2008.
  8. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5(4):377–390, 2014.
  9. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35, 2022.
  10. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  11. Masked spectrogram prediction for self-supervised audio pre-training. arXiv preprint arXiv:2204.12768, 2022.
  12. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018.
  13. Test-time training can close the natural distribution shift performance gap in deep learning based compressed sensing. In International Conference on Machine Learning, pages 4754–4776. PMLR, 2022.
  14. Domain generalization via model-agnostic learning of semantic features. Advances in Neural Information Processing Systems, 32, 2019.
  15. K. Dupuis and M. K. Pichora-Fuller. Recognition of emotional speech for younger and older talkers: Behavioural findings from the toronto emotional speech set. Canadian Acoustics, 39(3):182–183, 2011.
  16. Evaluating deep learning architectures for speech emotion recognition. Neural Networks, 92:60–68, 2017.
  17. Test-time training with masked autoencoders. In Advances in Neural Information Processing Systems, 2022.
  18. Speaker recognition benchmark using the chime-5 corpus. In Interspeech, pages 1506–1510, 2019.
  19. Ssast: Self-supervised audio spectrogram transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10699–10709, 2022.
  20. Test-time adaptation via conjugate pseudo-labels. arXiv preprint arXiv:2207.09640, 2022.
  21. Conformer: Convolution-augmented transformer for speech recognition. In Interspeech 2020, pages 5036–5040, 2020.
  22. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  23. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  24. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  25. Redat: Accent-invariant representation for end-to-end asr by domain adversarial training with relabeling. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6408–6412. IEEE, 2021.
  26. Masked autoencoders that listen. Advances in Neural Information Processing Systems, 35:28708–28720, 2022a.
  27. SPIRAL: self-supervised perturbation-invariant representation learning for speech pre-training. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022b.
  28. Resource-efficient transfer learning from speech foundation model using hierarchical feature fusion. arXiv preprint arXiv:2211.02712, 2022.
  29. Visual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pages 709–727. Springer, 2022.
  30. J. D. M.-W. C. Kenton and L. K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  31. Sita: Single image test-time adaptation. arXiv preprint arXiv:2112.02355, 2022.
  32. Unsupervised domain adaptation for speech recognition via uncertainty driven self-training. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6553–6557. IEEE, 2021.
  33. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations, 2022.
  34. Surgical fine-tuning improves adaptation to distribution shifts. arXiv preprint arXiv:2210.11466, 2022.
  35. Contrastive adversarial domain adaptation networks for speaker recognition. IEEE Transactions on Neural Networks and Learning Systems, 2020.
  36. Multilingual speech translation from efficient finetuning of pretrained models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 827–838, 2021.
  37. X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021.
  38. Rethinking evaluation in ASR: are our models robust enough? In Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, pages 311–315. ISCA, 2021.
  39. Listen, adapt, better wer: Source-free single-utterance test-time adaptation for automatic speech recognition. arXiv preprint arXiv:2203.14222, 2022.
  40. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6419–6423. IEEE, 2020.
  41. Ttt++: When does self-supervised test-time training fail or thrive? Advances in Neural Information Processing Systems, 34:21808–21820, 2021.
  42. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5), 2018.
  43. Conditional adversarial domain adaptation. Advances in neural information processing systems, 31, 2018.
  44. Speech restoration based on deep learning autoencoder with layer-wised pretraining. In Thirteenth Annual Conference of the International Speech Communication Association, 2012.
  45. Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 2022.
  46. Test-time adaptation to distribution shift by confidence maximization and input transformation. arXiv preprint arXiv:2106.14999, 2021.
  47. Evaluating prediction-time batch normalization for robustness under covariate shift. arXiv preprint arXiv:2006.10963, 2020.
  48. Masked spectrogram modeling using masked autoencoders for learning general-purpose audio representation. arXiv preprint arXiv:2204.12260, 2022.
  49. Tttflow: Unsupervised test-time training with normalizing flow. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2126–2134, 2023.
  50. Analysis of deep learning architectures for cross-corpus speech emotion recognition. In Interspeech, pages 1656–1660, 2019.
  51. Emotion recognition from speech using wav2vec 2.0 embeddings. Proc. Interspeech 2021, pages 3400–3404, 2021.
  52. Mad-x: An adapter-based framework for multi-task cross-lingual transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654–7673, 2020.
  53. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  54. A scalable noisy speech dataset and online subjective test framework. arXiv preprint arXiv:1909.08050, 2019.
  55. Improving robustness against common corruptions by covariate shift adaptation. Advances in Neural Information Processing Systems, 33:11539–11551, 2020.
  56. Deep neural network embeddings for text-independent speaker verification. In Interspeech, volume 2017, pages 999–1003, 2017.
  57. Learning corpus-invariant discriminant feature representations for speech emotion recognition. IEICE TRANSACTIONS on Information and Systems, 100(5):1136–1139, 2017.
  58. B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14, pages 443–450. Springer, 2016.
  59. Test-time training with self-supervision for generalization under distribution shifts. In International conference on machine learning, pages 9229–9248. PMLR, 2020.
  60. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237, 2022a.
  61. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237, 2022b.
  62. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017.
  63. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008.
  64. Generalizing to unseen domains via adversarial data augmentation. Advances in neural information processing systems, 31, 2018.
  65. Tent: Fully test-time adaptation by entropy minimization. In International Conference on Learning Representations, 2021.
  66. Test-time training on video streams, 2023.
  67. P. Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018.
  68. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698, 2020.
  69. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92). University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.
  70. Adversarial teacher-student representation learning for domain generalization. Advances in Neural Information Processing Systems, 34:19448–19460, 2021.
  71. Parameter-efficient tuning makes a good classification head. arXiv preprint arXiv:2210.16771, 2022.
  72. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 1–9, 2022.
  73. Speech emotion recognition with co-attention based multi-level acoustic information. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7367–7371. IEEE, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.