Towards Event Extraction from Speech with Contextual Clues (2401.15385v1)
Abstract: While text-based event extraction has been an active research area and has seen successful application in many domains, extracting semantic events from speech directly is an under-explored problem. In this paper, we introduce the Speech Event Extraction (SpeechEE) task and construct three synthetic training sets and one human-spoken test set. Compared to event extraction from text, SpeechEE poses greater challenges mainly due to complex speech signals that are continuous and have no word boundaries. Additionally, unlike perceptible sound events, semantic events are more subtle and require a deeper understanding. To tackle these challenges, we introduce a sequence-to-structure generation paradigm that can produce events from speech signals in an end-to-end manner, together with a conditioned generation method that utilizes speech recognition transcripts as the contextual clue. We further propose to represent events with a flat format to make outputs more natural language-like. Our experimental results show that our method brings significant improvements on all datasets, achieving a maximum F1 gain of 10.7%. The code and datasets are released on https://github.com/jodie-kang/SpeechEE.
- Y. Lu, H. Lin, J. Xu, X. Han, J. Tang, A. Li, L. Sun, M. Liao, and S. Chen, “Text2event: Controllable sequence-to-structure generation for end-to-end event extraction,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 2795–2806.
- Y. Lu, Q. Liu, D. Dai, X. Xiao, H. Lin, X. Han, L. Sun, and H. Wu, “Unified structure generation for universal information extraction,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5755–5772.
- I.-H. Hsu, K.-H. Huang, E. Boschee, S. Miller, P. Natarajan, K.-W. Chang, and N. Peng, “Degree: A data-efficient generation-based event extraction model,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 1890–1908.
- T. Wu, G. Wang, J. Zhao, Z. Liu, G. Qi, Y.-F. Li, and G. Haffari, “Towards relation extraction from speech,” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10751–10762, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics., 2022.
- J. Zhao, H. Yang, E. Shareghi, and G. Haffari, “M-adapter: Modality adaptation for end-to-end speech-to-text translation,” arXiv preprint arXiv:2207.00952, 2022.
- Z. Mnasri, S. Rovetta, and F. Masulli, “Anomalous sound event detection: A survey of machine learning based methods and applications,” Multimedia Tools and Applications, pp. 1–50, 2022.
- F. Ronchini and R. Serizel, “A benchmark of state-of-the-art sound event detection systems evaluated on synthetic soundscapes,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 1031–1035.
- S. Yang, D. Feng, L. Qiao, Z. Kan, and D. Li, “Exploring pre-trained language models for event extraction and generation,” in Proceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 5284–5294.
- A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28 492–28 518.
- C. Walker, S. Strassel, J. Medero, and K. Maeda, “Ace 2005 multilingual training corpus,” Linguistic Data Consortium, Philadelphia, vol. 57, p. 45, 2006.
- X. Wang, Z. Wang, X. Han, W. Jiang, R. Han, Z. Liu, J. Li, P. Li, Y. Lin, and J. Zhou, “Maven: A massive general domain event detection dataset,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 1652–1671.
- X. Li, F. Li, L. Pan, Y. Chen, W. Peng, Q. Wang, Y. Lyu, and Y. Zhu, “Duee: a large-scale dataset for chinese event extraction in real-world scenarios,” in Natural Language Processing and Chinese Computing: 9th CCF International Conference, NLPCC 2020, Zhengzhou, China, October 14–18, 2020, Proceedings, Part II 9. Springer, 2020, pp. 534–545.
- T. H. Nguyen, K. Cho, and R. Grishman, “Joint event extraction via recurrent neural networks,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 300–309.
- D. Wadden, U. Wennberg, Y. Luan, and H. Hajishirzi, “Entity, relation, and event extraction with contextualized span representations,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 5784–5789.
- N. Zhang, H. Ye, S. Deng, C. Tan, M. Chen, S. Huang, F. Huang, and H. Chen, “Contrastive information extraction with generative transformer,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3077–3088, 2021.
- K.-H. Huang and N. Peng, “Document-level event extraction with efficient end-to-end learning of cross-event dependencies,” in Proceedings of the Third Workshop on Narrative Understanding, 2021, pp. 36–47.
- K.-H. Huang, I.-H. Hsu, P. Natarajan, K.-W. Chang, and N. Peng, “Multilingual generative language models for zero-shot cross-lingual event argument extraction,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 4633–4646.
- S. Li, H. Ji, and J. Han, “Document-level event argument extraction by conditional generation,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 894–908.
- G. Paolini, B. Athiwaratkun, J. Krone, J. Ma, A. Achille, R. Anubhai, C. N. d. Santos, B. Xiang, and S. Soatto, “Structured prediction as translation between augmented natural languages,” In 9th International Conference on Learning Representations (ICLR), 2021.
- M. Malik, M. K. Malik, K. Mehmood, and I. Makhdoom, “Automatic speech recognition: a survey,” Multimedia Tools and Applications, vol. 80, pp. 9411–9457, 2021.
- E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. Gölge, and M. A. Ponti, “Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone,” in International Conference on Machine Learning. PMLR, 2022, pp. 2709–2720.
- C. Wang, S. Dai, Y. Wang, F. Yang, M. Qiu, K. Chen, W. Zhou, and J. Huang, “Arobert: An asr robust pre-trained language model for spoken language understanding,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1207–1218, 2022.
- K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi, “Spoken language understanding using long short-term memory neural networks,” in 2014 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2014, pp. 189–194.
- D. Serdyuk, Y. Wang, C. Fuegen, A. Kumar, B. Liu, and Y. Bengio, “Towards end-to-end spoken language understanding,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5754–5758.
- A. Kumar and B. Raj, “Audio event detection using weakly labeled data,” in Proceedings of the 24th ACM international conference on Multimedia, 2016, pp. 1038–1047.
- J. Abeßer, S. Grollmisch, and M. Müller, “How robust are audio embeddings for polyphonic sound event tagging?” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- T. N. T. Nguyen, K. N. Watcharasupat, N. K. Nguyen, D. L. Jones, and W.-S. Gan, “Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1749–1762, 2022.
- I. Martín-Morató and A. Mesaros, “Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 902–914, 2023.
- Y. Liu, H. Fu, Y. Wei, and H. Zhang, “Sound event classification based on frequency-energy feature representation and two-stage data dimension reduction,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1290–1304, 2023.
- K. Rosero, F. Grijalva, and B. Masiero, “Sound events localization and detection using bio-inspired gammatone filters and temporal convolutional neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- Z. Chen and Q. Huang, “Sound event localization and detection using parallel multi-attention enhancement,” Circuits, Systems, and Signal Processing, pp. 1–23, 2023.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning. PMLR, 2021, pp. 5530–5540.
- R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 4218–4222.
- Y. Lin, H. Ji, F. Huang, and L. Wu, “A joint neural model for information extraction with global features,” in Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 7999–8009.