Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AnnoTheia: A Semi-Automatic Annotation Toolkit for Audio-Visual Speech Technologies (2402.13152v1)

Published 20 Feb 2024 in cs.CV and cs.CL

Abstract: More than 7,000 known languages are spoken around the world. However, due to the lack of annotated resources, only a small fraction of them are currently covered by speech technologies. Albeit self-supervised speech representations, recent massive speech corpora collections, as well as the organization of challenges, have alleviated this inequality, most studies are mainly benchmarked on English. This situation is aggravated when tasks involving both acoustic and visual speech modalities are addressed. In order to promote research on low-resource languages for audio-visual speech technologies, we present AnnoTheia, a semi-automatic annotation toolkit that detects when a person speaks on the scene and the corresponding transcription. In addition, to show the complete process of preparing AnnoTheia for a language of interest, we also describe the adaptation of a pre-trained model for active speaker detection to Spanish, using a database not initially conceived for this type of task. The AnnoTheia toolkit, tutorials, and pre-trained models are available on GitHub.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. (2022). Exploring dementia detection from speech: Cross corpus analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6472–6476.
  2. (2018a). Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496.
  3. (2018b). Deep audio-visual speech recognition. Transactions on PAMI, 44(12):8717–8727.
  4. (2021). BOBSL: BBC-Oxford British Sign Language Dataset.
  5. (2023). MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation. In Proc. INTERSPEECH, pages 4064–4068.
  6. (2020). Common voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222.
  7. (2022). XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. In Proc. Interspeech, pages 2278–2282.
  8. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
  9. (2021). End-To-End Speaker Segmentation for Overlap-Aware Resegmentation. In Proc. Interspeech, pages 3111–3115.
  10. (2022). Global predictors of language endangerment and the future of linguistic diversity. Nature ecology & evolution, 6(2):163–173.
  11. (2021). Face, body, voice: Video person-clustering with multiple modalities. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3184–3194, October.
  12. (2017). How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In ICCV, pages 1021–1030.
  13. (2023). Audio-visual efficient conformer for robust speech recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2258–2267.
  14. (2023). Improving massively multilingual asr with auxiliary ctc objectives. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5.
  15. (2017). Lip reading in the wild. In 13th Asian Conference on Computer Vision, pages 87–103. Springer.
  16. (2020). Sign language transformers: Joint end-to-end sign language recognition and translation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10020–10030.
  17. (2010). Silent speech interfaces. Speech Communication, 52(4):270–287.
  18. (2019). Arcface: Additive angular margin loss for deep face recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4685–4694.
  19. (2020). Retinaface: Single-shot multi-level face localisation in the wild. In CVPR, pages 5202–5211.
  20. (2023). Ethnologue: languages of the world. Online version: http://www.ethnologue.com.
  21. (2021). LRWR: large-scale benchmark for lip reading in Russian language. arXiv preprint arXiv:2109.06692.
  22. (2018). Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM Trans. Graph., 37(4):1–11.
  23. (2018). Survey on automatic lip-reading in the era of deep learning. Image and Vision Computing, 78:53–72.
  24. (2017). Towards estimating the upper bound of visual-speech recognition: The visual lip-reading feasibility database. In 12th FG), pages 208–215. IEEE.
  25. (2019). End-to-End Neural Speaker Diarization with Permutation-Free Objectives. In Proc. Interspeech, pages 4300–4304.
  26. (2008). The application of hidden Markov models in speech recognition. Now Publishers Inc.
  27. (2022). LIP-RTVE: An Audiovisual Database for Continuous Spanish in the Wild. In LREC, pages 2750–2758. ELRA, June.
  28. (2020). Silent speech interfaces for speech restoration: A review. IEEE Access, 8:177995–178021.
  29. (2018). Network decoupling: From regular to depthwise separable convolutions. In British Machine Vision Conference.
  30. (2015). TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia, 17(5):603–615.
  31. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  32. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141.
  33. (2022). Investigating self-supervised learning for speech enhancement and separation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6837–6841.
  34. (2023). AV-TranSpeech: Audio-visual robust speech-to-speech translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8590–8604.
  35. (2019). Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model. In Proc. Interspeech, pages 1123–1127.
  36. (2022). Spoken dialogue systems. Springer Nature.
  37. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  38. (2015). Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding, 141:108–125.
  39. (2022). Direct speech-to-speech translation with discrete units. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 3327–3339.
  40. (2019a). Dsfd: Dual shot face detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5055–5064.
  41. (2019b). Neural speech synthesis with transformer network. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, page 6706–6713.
  42. (2023). A light weight model for active speaker detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22932–22941.
  43. (2023). Synthvsr: Scaling up visual speech recognition with synthetic supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18806–18815.
  44. (2021). End-to-end audio-visual speech recognition with conformers. In ICASSP, pages 7613–7617.
  45. (2022). Visual speech recognition for multiple languages in the wild. Nature Machine Intelligence, 4(11):930–939.
  46. (2022). Is speech the new blood? recent progress in ai-based disease detection from audio in a nutshell. Frontiers in Digital Health, 4.
  47. (2022). Learning long-term spatial-temporal graphs for active speaker detection. In ECCV, pages 371–387, Cham. Springer Nature Switzerland.
  48. (2017). VoxCeleb: A Large-Scale Speaker Identification Dataset. In Proc. Interspeech, pages 2616–2620.
  49. (2017). SEGAN: Speech Enhancement Generative Adversarial Network. In Proc. Interspeech, pages 3642–3646.
  50. (2023). A multi-purpose audio-visual corpus for multi-modal persian speech recognition: the arman-av dataset. arXiv preprint arXiv:2301.10180.
  51. (2022). Sub-word level lip reading with visual attention. In CVPR, pages 5162–5172.
  52. (2023). Scaling speech technology to 1,000+ languages. arXiv preprint arXiv:2305.13516.
  53. (2023). Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
  54. (2020). Multi-task self-supervised learning for robust speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6989–6993.
  55. (2020). AVA Active Speaker: An Audio-Visual Dataset for Active Speaker Detection. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4492–4496. IEEE.
  56. (2021). The Multilingual TEDx Corpus for Speech Recognition and Translation. In Proc. Interspeech, pages 3655–3659.
  57. (2018). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing, pages 4779–4783. IEEE.
  58. (2022). Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:2201.02184.
  59. (2023). Ml-superb: Multilingual speech universal performance benchmark. arXiv preprint arXiv:2305.10615.
  60. (2021). Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, page 3927–3935. Association for Computing Machinery.
  61. (2017). Attention is all you need. NeurIPS, 30:6000–6010.
  62. (2019). LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild. In 14th IEEE International Conference on Automatic Face & Gesture Recognition, pages 1–8.
  63. (2023). Cold diffusion for speech enhancement. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5.
  64. (2020). CMU-MOSEAS: A multimodal language dataset for spanish, portuguese, german and french. In EMNLP, pages 1801–1812.
  65. (2023). Google usm: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037.
  66. (2020). A cascade sequence-to-sequence model for chinese mandarin lip reading. In Proceedings of the ACM Multimedia Asia, pages 1–6. Association for Computing Machinery.

Summary

We haven't generated a summary for this paper yet.