Papers
Topics
Authors
Recent
Search
2000 character limit reached

AIx Speed: Playback Speed Optimization Using Listening Comprehension of Speech Recognition Models

Published 5 Mar 2024 in cs.CL, cs.HC, cs.LG, cs.SD, and eess.AS | (2403.02938v1)

Abstract: Since humans can listen to audio and watch videos at faster speeds than actually observed, we often listen to or watch these pieces of content at higher playback speeds to increase the time efficiency of content comprehension. To further utilize this capability, systems that automatically adjust the playback speed according to the user's condition and the type of content to assist in more efficient comprehension of time-series content have been developed. However, there is still room for these systems to further extend human speed-listening ability by generating speech with playback speed optimized for even finer time units and providing it to humans. In this study, we determine whether humans can hear the optimized speech and propose a system that automatically adjusts playback speed at units as small as phonemes while ensuring speech intelligibility. The system uses the speech recognizer score as a proxy for how well a human can hear a certain unit of speech and maximizes the speech playback speed to the extent that a human can hear. This method can be used to produce fast but intelligible speech. In the evaluation experiment, we compared the speech played back at a constant fast speed and the flexibly speed-up speech generated by the proposed method in a blind test and confirmed that the proposed method produced speech that was easier to listen to.

Authors (2)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Video summarization using deep neural networks: A survey. Proc. IEEE 109, 11 (2021), 1838–1863.
  2. Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Advances in Neural Information Processing Systems.
  3. Zx Bai and Xiao-Lei Zhang. 2021. Speaker recognition based on deep learning: An overview. Neural Networks (2021).
  4. SmartPlayer: User-Centric Video Fast-Forwarding. In Proc. of the SIGCHI Conference on Human Factors in Computing Systems.
  5. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.
  6. Songshuang Duan and Xiaoqian Chen. 2019. Why College Students Watch Streaming Drama at Higher Playback Speed: The Uses and Gratifications Perspective. In International Joint Conference on Information, Media and Engineering.
  7. Automatic speech recognition predicts speech intelligibility and comprehension for listeners with simulated age-related hearing loss. Journal of Speech, Language, and Hearing Research 60, 9 (2017), 2394–2405.
  8. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Nets. In ICML ’06: Proceedings of the International Conference on Machine Learning.
  9. EgoScanning: Quickly Scanning First-Person Videos with Egocentric Elastic Timelines. In Proc. ACM Conference on Human Factors in Computing Systems.
  10. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. (2021).
  11. Wenyu Jiang and H. Schulzrinne. 2002. Speech recognition performance as an effective perceived quality predictor. In IEEE 2002 Tenth IEEE International Workshop on Quality of Service (Cat. No.02EX564). 269–275.
  12. A Survey of Speaker Recognition: Fundamental Theories, Recognition Methods and Opportunities. IEEE Access 9 (2021), 79236–79263.
  13. Efficient Video Viewing System for Racquet Sports with Automatic Summarization Focusing on Rally Scenes. In ACM SIGGRAPH 2014 Posters.
  14. Dynamic Object Scanning: Object-Based Elastic Timeline for Quickly Browsing First-Person Videos. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems.
  15. Kazutaka Kurihara. 2011. CinemaGazer: A System for Watching Video at Very High Speed. In Proc. of the Workshop on Advanced Visual Interfaces AVI.
  16. Is Faster Better? A Study of Video Playback Speed. In Proc. of the Tenth International Conference on Learning Analytics & Knowledge.
  17. Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
  18. Automatic speech recognition: A survey. Multimedia Tools and Applications (2021).
  19. Librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference. Citeseer.
  20. English Speech Database Read by Japanese Learners for CALL System Development.. In LREC. Citeseer.
  21. Learning in double time: The effect of lecture video speed on immediate and delayed comprehension. Applied Cognitive Psychology 36, 1 (2022), 69–82.
  22. Toru Nagahama and Yusuke Morita. 2017. Effect Analysis of Playback Speed for Lecture Video Including Instructor Images. International Journal for Educational Media and Technology (2017).
  23. Laugh at Your Own Pace: Basic Performance Evaluation of Language Learning Assistance by Adjustment of Video Playback Speeds Based on Laughter Detection. In Proc. of the Ninth ACM Conference on Learning @ Scale.
  24. Librispeech: An ASR corpus based on public domain audio books. In IEEE International Conference on Acoustics, Speech and Signal Processing.
  25. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32.
  26. Automatically Adjusting the Speed of E-Learning Videos. In Proc. of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems.
  27. Automatic Speech Recognition (ASR) Systems Applied to Pronunciation Assessment of L2 Spanish for Japanese Speakers. Applied Sciences 11, 15 (2021).
  28. WaveNet: A Generative Model for Raw Audio. In Arxiv.
  29. Audio Summarization for Podcasts. In 2021 29th European Signal Processing Conference (EUSIPCO). IEEE, 431–435.
  30. Tacotron: Towards End-to-End Speech Synthesis. In INTERSPEECH. 4006–4010.
  31. Achieving Human Parity in Conversational Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  32. WithYou: Automated Adaptive Speech Tutoring With Context-Dependent Speech Recognition. In Proc. of the 2020 CHI Conference on Human Factors in Computing Systems.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 2 likes about this paper.